-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] faster affinity and dissimilarity matrix computation #64
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #64 +/- ##
==========================================
+ Coverage 92.14% 92.44% +0.29%
==========================================
Files 10 12 +2
Lines 891 913 +22
==========================================
+ Hits 821 844 +23
+ Misses 70 69 -1
☔ View full report in Codecov by Sentry. |
@adam2392 Thoughts? Also, I believe the style checks are failing for files I did not edit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per our discussion, maybe we should move this functionality to a ensemble/neighbors.py file so it can be used with supervised trees as well?
Any chance you have a short good unit test that works for this function? I realized we don't have one rn.
Can you just run black on the whole repo? |
Yep I can do that |
@adam2392 For the supervised versions, the options are to edit |
So, based on our email, we want to add this functionality to all classes, but also we might want to keep the fork lightweight unless necessary. I would just add a Mixin class that has this functionality and add the Mixin class to all the trees/forests applicable in scikit-tree. We should/could also expose a functional interface, so just
Then this is called in the tree/forest class methods. I think the functional approach makes more sense cuz it leans towards not adding to the tree classes API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind adding the dissimilarity matrix while you're at it and the reference that Jovo mentioned in the email thread?
https://arxiv.org/pdf/1812.00029.pdf
Otw, LGTM
To fix the CI issues:
|
Co-authored-by: Adam Li <adam2392@gmail.com>
Co-authored-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
I fixed the numpydoc errors, sphinx errors and also refactored this. I think storing a dissimilarity matrix is not needed and adds just RAM cost for no gain. Someone can just take I also factored out the similarity computation as a function, so it can be easily used with any BaseForest method that implements |
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All CIs fixed, and I applied some fixes. LGTM
Looks good to me! |
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
Signed-off-by: Adam Li <adam2392@gmail.com>
@sampan501 I removed the automatic computation of similarity matrix as this caused unnecessary RAM/CPU usage on the docs building. I think we can enable it in the future if necessary, but one can just call |
@sampan501 I noticed we never added Moreover, are we missing a division by the |
I think we talked about not storing both in memory since dissim is just 1-sim. Also since the max is 1, there is no need for dividing by max |
None
Changes proposed in this pull request:
n_estimators
Before submitting
section of the
CONTRIBUTING
docs.Writing docstrings section of the
CONTRIBUTING
docs.After submitting