[ENH] faster affinity and dissimilarity matrix computation #64

sampan501 · 2023-04-14T13:07:41Z

None

Changes proposed in this pull request:

Replace nested for loop in affinity calculation unsupervised forest with np.equal.outer and one for loop. Should for faster for large n_estimators

Before submitting

I've read and followed all steps in the Making a pull request
section of the CONTRIBUTING docs.
I've updated or added any relevant docstrings following the syntax described in the
Writing docstrings section of the CONTRIBUTING docs.
If this PR fixes a bug, I've added a test that will fail without my fix.
If this PR adds a new feature, I've added tests that sufficiently cover my new functionality.

After submitting

All GitHub Actions jobs for my pull request have passed.

codecov · 2023-04-14T13:25:48Z

Codecov Report

Patch coverage: 98.52% and project coverage change: +0.29 🎉

Comparison is base (b70935b) 92.14% compared to head (0b6d9a4) 92.44%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #64      +/-   ##
==========================================
+ Coverage   92.14%   92.44%   +0.29%     
==========================================
  Files          10       12       +2     
  Lines         891      913      +22     
==========================================
+ Hits          821      844      +23     
+ Misses         70       69       -1

Impacted Files	Coverage Δ
sktree/ensemble/_unsupervised_forest.py	`70.40% <90.90%> (-1.66%)`	⬇️
sktree/ensemble/_supervised_forest.py	`100.00% <100.00%> (ø)`
sktree/tests/test_neighbors.py	`100.00% <100.00%> (ø)`
sktree/tests/test_unsupervised_forest.py	`100.00% <100.00%> (ø)`
sktree/tree/_classes.py	`88.03% <100.00%> (-0.45%)`	⬇️
sktree/tree/_neighbors.py	`100.00% <100.00%> (ø)`
sktree/tree/tests/test_tree.py	`100.00% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

sampan501 · 2023-04-14T13:26:58Z

@adam2392 Thoughts?

Also, I believe the style checks are failing for files I did not edit

adam2392

Per our discussion, maybe we should move this functionality to a ensemble/neighbors.py file so it can be used with supervised trees as well?

Any chance you have a short good unit test that works for this function? I realized we don't have one rn.

adam2392 · 2023-04-14T13:28:46Z

@adam2392 Thoughts?

Also, I believe the style checks are failing for files I did not edit

Can you just run black on the whole repo?

sampan501 · 2023-04-14T13:30:29Z

Yep I can do that

sampan501 · 2023-04-14T15:32:43Z

@adam2392 For the supervised versions, the options are to edit ForestClassifier in scikit-learn-fork, or just add the similarity matrix computation to each respective forest type (and the same when I write the dissimilarity matrice functions). Which do you prefer?

adam2392 · 2023-04-14T15:52:12Z

@adam2392 For the supervised versions, the options are to edit ForestClassifier in scikit-learn-fork, or just add the similarity matrix computation to each respective forest type (and the same when I write the dissimilarity matrice functions). Which do you prefer?

So, based on our email, we want to add this functionality to all classes, but also we might want to keep the fork lightweight unless necessary.

I would just add a Mixin class that has this functionality and add the Mixin class to all the trees/forests applicable in scikit-tree.

We should/could also expose a functional interface, so just

def compute_similarity_matrix_forest(...)


def compute_dissimilarity_matrix_forest(...)

Then this is called in the tree/forest class methods. I think the functional approach makes more sense cuz it leans towards not adding to the tree classes API.

adam2392

Do you mind adding the dissimilarity matrix while you're at it and the reference that Jovo mentioned in the email thread?

https://arxiv.org/pdf/1812.00029.pdf

Otw, LGTM

adam2392 · 2023-04-14T19:15:43Z

To fix the CI issues:

run poetry run poe format which applies all formatting checks. You can run poetry run poe lint to check any flake8 issues and pydocstyle stuff.
add sktree/ensemble/_neighbors.py to the corresponding meson.build file inside ensemble/ to register the additional Python file.

sktree/ensemble/_unsupervised_forest.py

sktree/ensemble/_neighbors.py

sktree/ensemble/_supervised_forest.py

Co-authored-by: Adam Li <adam2392@gmail.com>

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2023-05-05T21:15:04Z

I fixed the numpydoc errors, sphinx errors and also refactored this.

I think storing a dissimilarity matrix is not needed and adds just RAM cost for no gain. Someone can just take 1 - forest.similarity_matrix_ and that gives the same computation.

I also factored out the similarity computation as a function, so it can be easily used with any BaseForest method that implements apply.

f4f977e

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392

All CIs fixed, and I applied some fixes. LGTM

sampan501 · 2023-05-06T22:29:22Z

Looks good to me!

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 · 2023-05-08T15:51:03Z

@sampan501 I removed the automatic computation of similarity matrix as this caused unnecessary RAM/CPU usage on the docs building. I think we can enable it in the future if necessary, but one can just call compute_similarity_matrix(X).

adam2392 · 2023-06-16T03:37:28Z

@sampan501 I noticed we never added dissimilarity_matrix_. What is the way to compute this that's in-line w/ the affinity matrix?

Moreover, are we missing a division by the affinity_matrix.max() somewhere?

sampan501 · 2023-06-16T11:01:15Z

I think we talked about not storing both in memory since dissim is just 1-sim.

Also since the max is 1, there is no need for dividing by max

replace nested for loop with np.equal.outer

54a52c1

sampan501 requested a review from adam2392 April 14, 2023 13:11

fix black formatting

2373b26

adam2392 reviewed Apr 14, 2023

View reviewed changes

sampan501 changed the title ~~[ENH] faster affinity matrix computation~~ [ENH] faster affinity and dissimilarity matrix computation Apr 14, 2023

refactor unsupervised random forest similarity matrix calculation

65c44ca

adam2392 reviewed Apr 14, 2023

View reviewed changes

Merge branch 'main' into better-affinity-matrix

b7d3bc1

adam2392 mentioned this pull request Apr 18, 2023

ENH: port honest forests from neurodata/honest-forests #57

Merged

5 tasks

sampan501 and others added 2 commits May 4, 2023 11:10

Merge branch 'main' into better-affinity-matrix

e42fd73

add neighbors.py to meson

9af5db3

adam2392 reviewed May 4, 2023

View reviewed changes

sktree/ensemble/_unsupervised_forest.py Outdated Show resolved Hide resolved

sampan501 added 4 commits May 4, 2023 12:27

refactor sim_matrix to mixin class

95f68d2

add sim/disim matrix as class attributes

0921aea

update changelog

15aa389

remove affinity matrix reference

87e01b3

adam2392 reviewed May 4, 2023

View reviewed changes

sktree/ensemble/_neighbors.py Outdated Show resolved Hide resolved

adam2392 reviewed May 4, 2023

View reviewed changes

sktree/ensemble/_supervised_forest.py Outdated Show resolved Hide resolved

adam2392 reviewed May 4, 2023

View reviewed changes

sktree/ensemble/_supervised_forest.py Outdated Show resolved Hide resolved

sampan501 and others added 4 commits May 4, 2023 14:40

Update sktree/ensemble/_supervised_forest.py

4c50133

Co-authored-by: Adam Li <adam2392@gmail.com>

Update sktree/ensemble/_supervised_forest.py

005bef7

Co-authored-by: Adam Li <adam2392@gmail.com>

Try gh actions

fbf8bca

Signed-off-by: Adam Li <adam2392@gmail.com>

Fix

f4f977e

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 added 9 commits May 5, 2023 17:31

Fix sphinx

f7cb583

Signed-off-by: Adam Li <adam2392@gmail.com>

Try running again

d11b894

Signed-off-by: Adam Li <adam2392@gmail.com>

Fix main yml

7b38e11

Signed-off-by: Adam Li <adam2392@gmail.com>

Fix unsupervised

62f3349

Signed-off-by: Adam Li <adam2392@gmail.com>

Try fixing this

eb1f9f4

Signed-off-by: Adam Li <adam2392@gmail.com>

Try again

1301f1d

Signed-off-by: Adam Li <adam2392@gmail.com>

Try again

2ae9d83

Signed-off-by: Adam Li <adam2392@gmail.com>

Try again

22e29cb

Signed-off-by: Adam Li <adam2392@gmail.com>

Add coverage for trees

ecc61f9

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 approved these changes May 6, 2023

View reviewed changes

adam2392 added 11 commits May 6, 2023 18:32

Fix where neighbors starts

a10d5b8

Signed-off-by: Adam Li <adam2392@gmail.com>

Clean meson build

cfbacea

Signed-off-by: Adam Li <adam2392@gmail.com>

Try again

f73e804

Signed-off-by: Adam Li <adam2392@gmail.com>

Fix noplot

69e7d1e

Signed-off-by: Adam Li <adam2392@gmail.com>

Try again

da48bb1

Signed-off-by: Adam Li <adam2392@gmail.com>

Fixed

4b066fe

Signed-off-by: Adam Li <adam2392@gmail.com>

Fix

459575d

Signed-off-by: Adam Li <adam2392@gmail.com>

Try again

d35a678

Signed-off-by: Adam Li <adam2392@gmail.com>

Try again

d30b5dd

Signed-off-by: Adam Li <adam2392@gmail.com>

Factor out exp check

0b6d9a4

Signed-off-by: Adam Li <adam2392@gmail.com>

Work?

145d459

Signed-off-by: Adam Li <adam2392@gmail.com>

adam2392 merged commit aa7e174 into main May 8, 2023
19 checks passed

adam2392 deleted the better-affinity-matrix branch May 8, 2023 15:41

adam2392 mentioned this pull request May 9, 2023

Change build requirements to use pip's scikit-learn-tree #74

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] faster affinity and dissimilarity matrix computation #64

[ENH] faster affinity and dissimilarity matrix computation #64

sampan501 commented Apr 14, 2023 •

edited

codecov bot commented Apr 14, 2023 •

edited

sampan501 commented Apr 14, 2023

adam2392 left a comment

adam2392 commented Apr 14, 2023

sampan501 commented Apr 14, 2023

sampan501 commented Apr 14, 2023

adam2392 commented Apr 14, 2023

adam2392 left a comment

adam2392 commented Apr 14, 2023

adam2392 commented May 5, 2023 •

edited

adam2392 left a comment

sampan501 commented May 6, 2023

adam2392 commented May 8, 2023

adam2392 commented Jun 16, 2023

sampan501 commented Jun 16, 2023

[ENH] faster affinity and dissimilarity matrix computation #64

[ENH] faster affinity and dissimilarity matrix computation #64

Conversation

sampan501 commented Apr 14, 2023 • edited

Before submitting

After submitting

codecov bot commented Apr 14, 2023 • edited

Codecov Report

sampan501 commented Apr 14, 2023

adam2392 left a comment

Choose a reason for hiding this comment

adam2392 commented Apr 14, 2023

sampan501 commented Apr 14, 2023

sampan501 commented Apr 14, 2023

adam2392 commented Apr 14, 2023

adam2392 left a comment

Choose a reason for hiding this comment

adam2392 commented Apr 14, 2023

adam2392 commented May 5, 2023 • edited

adam2392 left a comment

Choose a reason for hiding this comment

sampan501 commented May 6, 2023

adam2392 commented May 8, 2023

adam2392 commented Jun 16, 2023

sampan501 commented Jun 16, 2023

sampan501 commented Apr 14, 2023 •

edited

codecov bot commented Apr 14, 2023 •

edited

adam2392 commented May 5, 2023 •

edited