PUBDEV-7138 Extended Isolation Forest #4319

valenad1 · 2020-02-14T11:46:38Z

The production version of algorithm will be finished here: #5246

Implementation of Extended Isolation Forest algorithm.

https://0xdata.atlassian.net/browse/PUBDEV-7138

All review feedback is implemented there or noted for future implementation.

… blob data

…idal data

…Min and max is taken from min and max of given frame.

…in extended isolation forest algorithm.

…Uniform distribution. Min and max is taken from min and max of given frame.

From all columns of matrix the given column vector is subtracted.

…ian distribution. Possibility to make last values in Vec zeros. Used in extended isolation forest algorithm.

…es from Uniform distribution. Min and max is taken from min and max of given frame. First version in 14caaef

Previous implementation requre dimension of matrix (m x n) and vector (m x 1) insted of vector dimension (n x 1) or (1 x n).

Previous implementation of matrix multiplication requires matrix dimension (m x n) and vector dimension (m x 1). I created a new method for vector (n x 1) and left the old implementation unouched because it is already used.

…pendencies

… implementation

h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForest.java

h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForestParameters.java

…f arrays. The result will be in the longer arrays from those two.

…ation from double[][]

…ion Forest. Inspired by Isolation Forest documentation.

maurever

Looks good! Minor changes are requested.

h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForest.java

h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForestModel.java

h2o-core/src/main/java/water/util/ArrayUtils.java

h2o-docs/src/product/data-science/eif.rst

...nmodel/src/main/java/hex/genmodel/algos/isoforextended/ExtendedIsolationForestMojoModel.java

h2o-py/docs/modeling.rst

h2o-py/h2o/estimators/extended_isolation_forest.py

h2o-py/tests/testdir_algos/isoforextended/pyunit_isoforextended_smoke.py

… MatrixUtils

h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForest.java

…turning of dummy data

…olation Forest smoke tests for Python and R

…orest user documentation

…ed Isolation Forest documentation

…of Extended Isolation Forest

h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForest.java

valenad1 · 2020-10-05T03:49:32Z

Link: https://github.com/valenad1/h2o-3/blob/PUBDEV-7138-extended-isolation-forest-leak-fix/h2o-py/demos/extisofor/extended-isolation-forest-benchmark.ipynb

Extended Isolation Forest Benchmark - Training stage

In the tests, H2O cloud is initialized before each run and shutdown each every run. The algorithm is firstly tested on Training performance and compared to the Isolation Forest run time.

IF = Isolation Forest
EIF = Extended Isolation Forest
N = number of rows
P = number of columns
ntrees = number of trees to train (always 100)
sample_size = how many rows will be used to build a tree
max_depth = only for IF, how big is the depth of the tree, in EIF is always set on math.ceil(math.log(sample_size, 2)) and max_depth is always depended on sample_size in the benchmark.

Computer parameters:

Lenovo ThinkPad P53,
MS Windows 10 Pro x64,
Intel Core i7-9850H CPU @ 2.60GHz,
6 cores and 12 threads,
96.0 GB RAM.

Training process:

The algorithm runtime performance depends on ntrees and sample_size parameters. Parameter ntrees is fixed on 100 for all runs. Setting of sample_size and the amount of data is varying in runs. I focus to put values that switch ON/OFF the MRTask.

Results

1.1 Toy data

1.1.1 Small data and small dimension

1.1.2 Small data and high dimension

1.1.3 Big data - small dimension, small sample_size

1.1.4 Big data - high dimension, small sample_size

1.1.5 Big data - small dimension, big sample_size

1.1.6 Big data - high dimension, big sample_size

N = 100_000
P = 30
sample_size = 10_000

Finish with error (More information in conclusion)

H2OConnectionError: Unexpected HTTP error: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

1.2 Real Credit Card Fraud Detection Data

1.2.1 Test with samle_size = 256 (default value)

1.2.2 Real data - sample_size = 1% (2 848 rows) of data

Finish with error (More information in conclusion)

H2OConnectionError: Unexpected HTTP error: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Conclusion

The biggest problem I see across various runs is the sensitivity on P parameter no matter what the sample_size setting is. EIF is significantly slower than IF even for small data with 500 rows (section 1.1.2). If N grows and sample_size is small, then EIF finishes slower (sections 1.1.4, 1.2.1). The real problem comes with large data with big sample_size where EIF is not able to finish most probably because of memory requirements(sections 1.1.6, 1.2.2, 1.2.3). When I test it in Java, all trees are started to be prepared at once and save their Nodes and temporary Frames which leads to following warning message. With the same settings, the IF finish without any problem with good performance.

WARN water.default: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:139,7 MB + POJO:19,27 GB + FREE:1,87 GB == MEM_MAX:21,27 GB), desiredKV=2,66 GB OOM!

If P parameter is small and sample_size is small, then EIF performs equally or better than IF.
If P parameter is small and sample_size is big, then EIF performs equally to IF.

Solution attempt:

I have tried to solve it with better distribution IsolationTree building. I hardcode to build 144 trees (12 on each thread) but with no difference in run time. Even if it works, I don't think it would be usable for a cluster where the performance of each node can and most probably will vary. That is why I only submitting tasks to the H2O cluster (one task per one tree). Trees are operating with their own sub-sample of data and can be built totally independently. But build all trees at one time seems to be memory and time-consuming.

h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForest.java

h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForestMojoWriter.java

h2o-algos/src/main/java/hex/tree/isoforextended/SubSampleTask.java

h2o-core/src/main/java/water/util/MathUtils.java

h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForest.java

maurever · 2020-10-05T11:48:35Z

I am sorry @honzasterba and @michalkurka, I sent you a new request for review, but I would like to know just your opinion on the benchmark right now.

But thank you @honzasterba for your new review. We agreed with your previous comments about removing MOJO and extension from SharedTreeModel. @valenad1 just has not implemented it yet, but it is on his todo list.

valenad1 · 2021-01-20T18:08:03Z

The production version of algorithm will be finished here: #5246

valenad1 added 9 commits December 20, 2019 02:25

PUBDEV-7138 - Comparsion of isolation forest implementations - single…

07fee07

… blob data

PUBDEV-7138 - Comparsion of isolation forest implementations - double…

a1e5fab

… blob data

PUBDEV-7138 - Comparsion of isolation forest implementations - sinuso…

d13df40

…idal data

PUBDEV-7138 - make MatrixUtils class visible from all algos packages

b31be4e

PUBDEV-7138 - add method for vector multiplication into the MatrixUtils

b080435

PUBDEV-7138 - generate vector with values from Gaussian distribution

a3ee689

PUBDEV-7138 - generate vector with values from Uniform distribution. …

2edd164

…Min and max is taken from min and max of given frame.

PUBDEV-7138 - add possibility to make last values in Vec zeros. Used …

42c5c0f

…in extended isolation forest algorithm.

PUBDEV-7138 - Map reduce task for generation vector with values from …

14caaef

…Uniform distribution. Min and max is taken from min and max of given frame.

valenad1 requested a review from maurever February 14, 2020 11:47

valenad1 added 6 commits February 15, 2020 02:39

PUBDEV-7138 - Map reduce task for matrix subtraction

d07af0f

From all columns of matrix the given column vector is subtracted.

PUBDEV-7138 - MR task for generation of vector with values from Gauss…

2029d25

…ian distribution. Possibility to make last values in Vec zeros. Used in extended isolation forest algorithm.

PUBDEV-7138 - Enhance map reduce task for generation vector with valu…

2f74fb3

…es from Uniform distribution. Min and max is taken from min and max of given frame. First version in 14caaef

PUBDEV-7138 - Fix - Map reduce task for matrix subtraction

73fbc07

Previous implementation requre dimension of matrix (m x n) and vector (m x 1) insted of vector dimension (n x 1) or (1 x n).

PUBDEV-7138 - implement matrix and vector multiplication on MR

22dfcaf

Previous implementation of matrix multiplication requires matrix dimension (m x n) and vector dimension (m x 1). I created a new method for vector (n x 1) and left the old implementation unouched because it is already used.

PUBDEV-7138 - create extended isolation forest class with required de…

32d708f

…pendencies

valenad1 force-pushed the PUBDEV-7138-extended-isolation-forest branch from 269d5a6 to 2290b92 Compare February 17, 2020 02:24

PUBDEV-7138 - Extended isolation forest tree split first not finished…

16b5888

… implementation

valenad1 force-pushed the PUBDEV-7138-extended-isolation-forest branch 3 times, most recently from 4c7bf7d to 31549f4 Compare February 18, 2020 19:54

michalkurka self-requested a review February 18, 2020 21:19

maurever reviewed Feb 20, 2020

View reviewed changes

h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForest.java Outdated Show resolved Hide resolved

h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForestParameters.java Outdated Show resolved Hide resolved

valenad1 added 7 commits February 25, 2020 20:37

PUBDEV-7138 - Save product result into a vector in map method.

8ccc9f6

PUBDEV-7138 - refactoring

6411717

PUBDEV-7138 - matrix vector multiplication test for large dataset

16c165f

PUBDEV-7138 - add check to ArrayUtils.add method for different size o…

0b16804

…f arrays. The result will be in the longer arrays from those two.

PUBDEV-7138 - remove forgotten sout

4f85a71

PUBDEV-7138 - second frame spliting implementation. Without frame cre…

25f4ef0

…ation from double[][]

PUBDEV-7138 - let know about DMatrix.java class and change javadoc

60aff72

valenad1 added 2 commits August 28, 2020 01:44

PUBDEV-7138 - use Vector Reader for quicker access to the vector values

716da69

PUBDEV-7138 - first version of user documentation for Extended Isolat…

a085b27

…ion Forest. Inspired by Isolation Forest documentation.

maurever requested changes Aug 28, 2020

View reviewed changes

valenad1 added 3 commits August 29, 2020 14:36

PUBDEV-7138 - use debug logs in Extended Isolation Forest scoring

283b95f

PUBDEV-7138 - leave current arrays add method unchanged

8a17811

PUBDEV-7138 - implement array lenght save ADD operation and use it in…

f9b88f5

… MatrixUtils

michalkurka reviewed Aug 31, 2020

View reviewed changes

h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForest.java Outdated Show resolved Hide resolved

michalkurka reviewed Aug 31, 2020

View reviewed changes

h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForest.java Show resolved Hide resolved

valenad1 added 6 commits August 31, 2020 20:11

KONREP-7138 - throw exception in not implemented method instead of re…

ad943bc

…turning of dummy data

PUBDEV-7138 - put comment and link to the source paper to Extended Is…

d038178

…olation Forest smoke tests for Python and R

PUBDEV-7138 - add link to Blogs and tutorials in Extended Isolation F…

4c5a94c

…orest user documentation

PUBDEV-7138 - create extension_level document page and improve Extend…

c223ff4

…ed Isolation Forest documentation

PUBDEV-7138 - add example to extension_level in python documentation …

2885f86

…of Extended Isolation Forest

PUBDEV-7138 - update job in the IsolationTreeForkJoin task

7fd5a54

maurever self-requested a review September 8, 2020 09:18

maurever approved these changes Sep 8, 2020

View reviewed changes

honzasterba reviewed Sep 9, 2020

View reviewed changes

h2o-algos/src/main/java/hex/tree/isoforextended/ExtendedIsolationForest.java Show resolved Hide resolved

maurever requested review from maurever, michalkurka and honzasterba October 5, 2020 08:39

honzasterba reviewed Oct 5, 2020

View reviewed changes

valenad1 changed the title ~~PUBDEV-7138 Extended Isolation Forest~~ PUBDEV-7138 - Extended Isolation Forest Jan 19, 2021

valenad1 changed the title ~~PUBDEV-7138 - Extended Isolation Forest~~ PUBDEV-7138 Extended Isolation Forest Jan 19, 2021

valenad1 changed the base branch from master to rel-zermelo January 20, 2021 18:04

valenad1 changed the base branch from rel-zermelo to master January 20, 2021 18:05

valenad1 closed this Jan 20, 2021

h2o-ops mentioned this pull request May 14, 2023

Implement Extended Isolation Forest #8501

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PUBDEV-7138 Extended Isolation Forest #4319

PUBDEV-7138 Extended Isolation Forest #4319

valenad1 commented Feb 14, 2020 •

edited

Loading

maurever left a comment

valenad1 commented Oct 5, 2020

maurever commented Oct 5, 2020

valenad1 commented Jan 20, 2021

PUBDEV-7138 Extended Isolation Forest #4319

PUBDEV-7138 Extended Isolation Forest #4319

Conversation

valenad1 commented Feb 14, 2020 • edited Loading

maurever left a comment

Choose a reason for hiding this comment

valenad1 commented Oct 5, 2020

Extended Isolation Forest Benchmark - Training stage

Computer parameters:

Training process:

Results

1.1 Toy data

1.1.1 Small data and small dimension

1.1.2 Small data and high dimension

1.1.3 Big data - small dimension, small sample_size

1.1.4 Big data - high dimension, small sample_size

1.1.5 Big data - small dimension, big sample_size

1.1.6 Big data - high dimension, big sample_size

1.2 Real Credit Card Fraud Detection Data

1.2.1 Test with samle_size = 256 (default value)

1.2.2 Real data - sample_size = 1% (2 848 rows) of data

Conclusion

maurever commented Oct 5, 2020

valenad1 commented Jan 20, 2021

valenad1 commented Feb 14, 2020 •

edited

Loading