-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PUBDEV-7138 Extended Isolation Forest #4319
PUBDEV-7138 Extended Isolation Forest #4319
Conversation
…Min and max is taken from min and max of given frame.
…in extended isolation forest algorithm.
…Uniform distribution. Min and max is taken from min and max of given frame.
From all columns of matrix the given column vector is subtracted.
…ian distribution. Possibility to make last values in Vec zeros. Used in extended isolation forest algorithm.
…es from Uniform distribution. Min and max is taken from min and max of given frame. First version in 14caaef
Previous implementation requre dimension of matrix (m x n) and vector (m x 1) insted of vector dimension (n x 1) or (1 x n).
Previous implementation of matrix multiplication requires matrix dimension (m x n) and vector dimension (m x 1). I created a new method for vector (n x 1) and left the old implementation unouched because it is already used.
269d5a6
to
2290b92
Compare
4c7bf7d
to
31549f4
Compare
h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForest.java
Outdated
Show resolved
Hide resolved
h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForestParameters.java
Outdated
Show resolved
Hide resolved
…f arrays. The result will be in the longer arrays from those two.
…ation from double[][]
…ion Forest. Inspired by Isolation Forest documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Minor changes are requested.
h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForest.java
Show resolved
Hide resolved
h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForestModel.java
Outdated
Show resolved
Hide resolved
...nmodel/src/main/java/hex/genmodel/algos/isoforextended/ExtendedIsolationForestMojoModel.java
Show resolved
Hide resolved
h2o-py/tests/testdir_algos/isoforextended/pyunit_isoforextended_smoke.py
Show resolved
Hide resolved
h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForest.java
Outdated
Show resolved
Hide resolved
h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForest.java
Show resolved
Hide resolved
…turning of dummy data
…olation Forest smoke tests for Python and R
…orest user documentation
…ed Isolation Forest documentation
…of Extended Isolation Forest
h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForest.java
Show resolved
Hide resolved
Extended Isolation Forest Benchmark - Training stageIn the tests, H2O cloud is initialized before each run and shutdown each every run. The algorithm is firstly tested on Training performance and compared to the Isolation Forest run time.
Computer parameters:
Training process:The algorithm runtime performance depends on ntrees and sample_size parameters. Parameter ntrees is fixed on 100 for all runs. Setting of sample_size and the amount of data is varying in runs. I focus to put values that switch ON/OFF the MRTask. Results1.1 Toy data1.1.1 Small data and small dimension1.1.2 Small data and high dimension1.1.3 Big data - small dimension, small sample_size1.1.4 Big data - high dimension, small sample_size1.1.5 Big data - small dimension, big sample_size1.1.6 Big data - high dimension, big sample_size
Finish with error (More information in conclusion)
1.2 Real Credit Card Fraud Detection Data1.2.1 Test with samle_size = 256 (default value)1.2.2 Real data - sample_size = 1% (2 848 rows) of dataFinish with error (More information in conclusion)
Conclusion
Solution attempt: I have tried to solve it with better distribution IsolationTree building. I hardcode to build 144 trees (12 on each thread) but with no difference in run time. Even if it works, I don't think it would be usable for a cluster where the performance of each node can and most probably will vary. That is why I only submitting tasks to the H2O cluster (one task per one tree). Trees are operating with their own sub-sample of data and can be built totally independently. But build all trees at one time seems to be memory and time-consuming. |
h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForest.java
Show resolved
Hide resolved
h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForest.java
Show resolved
Hide resolved
h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForest.java
Show resolved
Hide resolved
h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForestMojoWriter.java
Show resolved
Hide resolved
h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForest.java
Show resolved
Hide resolved
h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForest.java
Show resolved
Hide resolved
I am sorry @honzasterba and @michalkurka, I sent you a new request for review, but I would like to know just your opinion on the benchmark right now. But thank you @honzasterba for your new review. We agreed with your previous comments about removing MOJO and extension from SharedTreeModel. @valenad1 just has not implemented it yet, but it is on his todo list. |
The production version of algorithm will be finished here: #5246 |
The production version of algorithm will be finished here: #5246
Implementation of Extended Isolation Forest algorithm.
https://0xdata.atlassian.net/browse/PUBDEV-7138
All review feedback is implemented there or noted for future implementation.