Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] faster affinity and dissimilarity matrix computation #64

Merged
merged 34 commits into from
May 8, 2023

Conversation

sampan501
Copy link
Member

@sampan501 sampan501 commented Apr 14, 2023

None

Changes proposed in this pull request:

  • Replace nested for loop in affinity calculation unsupervised forest with np.equal.outer and one for loop. Should for faster for large n_estimators

Before submitting

  • I've read and followed all steps in the Making a pull request
    section of the CONTRIBUTING docs.
  • I've updated or added any relevant docstrings following the syntax described in the
    Writing docstrings section of the CONTRIBUTING docs.
  • If this PR fixes a bug, I've added a test that will fail without my fix.
  • If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

  • All GitHub Actions jobs for my pull request have passed.

@sampan501 sampan501 requested a review from adam2392 April 14, 2023 13:11
@codecov
Copy link

codecov bot commented Apr 14, 2023

Codecov Report

Patch coverage: 98.52% and project coverage change: +0.29 🎉

Comparison is base (b70935b) 92.14% compared to head (0b6d9a4) 92.44%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #64      +/-   ##
==========================================
+ Coverage   92.14%   92.44%   +0.29%     
==========================================
  Files          10       12       +2     
  Lines         891      913      +22     
==========================================
+ Hits          821      844      +23     
+ Misses         70       69       -1     
Impacted Files Coverage Δ
sktree/ensemble/_unsupervised_forest.py 70.40% <90.90%> (-1.66%) ⬇️
sktree/ensemble/_supervised_forest.py 100.00% <100.00%> (ø)
sktree/tests/test_neighbors.py 100.00% <100.00%> (ø)
sktree/tests/test_unsupervised_forest.py 100.00% <100.00%> (ø)
sktree/tree/_classes.py 88.03% <100.00%> (-0.45%) ⬇️
sktree/tree/_neighbors.py 100.00% <100.00%> (ø)
sktree/tree/tests/test_tree.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@sampan501
Copy link
Member Author

@adam2392 Thoughts?

Also, I believe the style checks are failing for files I did not edit

Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per our discussion, maybe we should move this functionality to a ensemble/neighbors.py file so it can be used with supervised trees as well?

Any chance you have a short good unit test that works for this function? I realized we don't have one rn.

@adam2392
Copy link
Collaborator

@adam2392 Thoughts?

Also, I believe the style checks are failing for files I did not edit

Can you just run black on the whole repo?

@sampan501
Copy link
Member Author

Yep I can do that

@sampan501 sampan501 changed the title [ENH] faster affinity matrix computation [ENH] faster affinity and dissimilarity matrix computation Apr 14, 2023
@sampan501
Copy link
Member Author

@adam2392 For the supervised versions, the options are to edit ForestClassifier in scikit-learn-fork, or just add the similarity matrix computation to each respective forest type (and the same when I write the dissimilarity matrice functions). Which do you prefer?

@adam2392
Copy link
Collaborator

@adam2392 For the supervised versions, the options are to edit ForestClassifier in scikit-learn-fork, or just add the similarity matrix computation to each respective forest type (and the same when I write the dissimilarity matrice functions). Which do you prefer?

So, based on our email, we want to add this functionality to all classes, but also we might want to keep the fork lightweight unless necessary.

I would just add a Mixin class that has this functionality and add the Mixin class to all the trees/forests applicable in scikit-tree.

We should/could also expose a functional interface, so just

def compute_similarity_matrix_forest(...)


def compute_dissimilarity_matrix_forest(...)

Then this is called in the tree/forest class methods. I think the functional approach makes more sense cuz it leans towards not adding to the tree classes API.

Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind adding the dissimilarity matrix while you're at it and the reference that Jovo mentioned in the email thread?

https://arxiv.org/pdf/1812.00029.pdf

Otw, LGTM

@adam2392
Copy link
Collaborator

To fix the CI issues:

  • run poetry run poe format which applies all formatting checks. You can run poetry run poe lint to check any flake8 issues and pydocstyle stuff.
  • add sktree/ensemble/_neighbors.py to the corresponding meson.build file inside ensemble/ to register the additional Python file.

sampan501 and others added 4 commits May 4, 2023 14:40
Co-authored-by: Adam Li <adam2392@gmail.com>
Co-authored-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392
Copy link
Collaborator

adam2392 commented May 5, 2023

I fixed the numpydoc errors, sphinx errors and also refactored this.

I think storing a dissimilarity matrix is not needed and adds just RAM cost for no gain. Someone can just take 1 - forest.similarity_matrix_ and that gives the same computation.

I also factored out the similarity computation as a function, so it can be easily used with any BaseForest method that implements apply.

f4f977e

Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Copy link
Collaborator

@adam2392 adam2392 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All CIs fixed, and I applied some fixes. LGTM

@sampan501
Copy link
Member Author

Looks good to me!

adam2392 added 11 commits May 6, 2023 18:32
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@adam2392 adam2392 merged commit aa7e174 into main May 8, 2023
19 checks passed
@adam2392 adam2392 deleted the better-affinity-matrix branch May 8, 2023 15:41
@adam2392
Copy link
Collaborator

adam2392 commented May 8, 2023

@sampan501 I removed the automatic computation of similarity matrix as this caused unnecessary RAM/CPU usage on the docs building. I think we can enable it in the future if necessary, but one can just call compute_similarity_matrix(X).

@adam2392
Copy link
Collaborator

@sampan501 I noticed we never added dissimilarity_matrix_. What is the way to compute this that's in-line w/ the affinity matrix?

Moreover, are we missing a division by the affinity_matrix.max() somewhere?

@sampan501
Copy link
Member Author

I think we talked about not storing both in memory since dissim is just 1-sim.

Also since the max is 1, there is no need for dividing by max

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants