Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PUBDEV-7138 Extended Isolation Forest #4319

Closed

Conversation

valenad1
Copy link
Collaborator

@valenad1 valenad1 commented Feb 14, 2020

The production version of algorithm will be finished here: #5246

Implementation of Extended Isolation Forest algorithm.

https://0xdata.atlassian.net/browse/PUBDEV-7138

All review feedback is implemented there or noted for future implementation.

From all columns of matrix the given column vector is subtracted.
…ian distribution. Possibility to make last values in Vec zeros. Used in extended isolation forest algorithm.
…es from Uniform distribution. Min and max is taken from min and max of given frame. First version in 14caaef
Previous implementation requre dimension of matrix (m x n) and vector (m x 1) insted of vector dimension (n x 1) or (1 x n).
Previous implementation of matrix multiplication requires matrix dimension (m x n) and vector dimension (m x 1).
I created a new method for vector (n x 1) and left the old implementation unouched because it is already used.
@valenad1 valenad1 force-pushed the PUBDEV-7138-extended-isolation-forest branch from 269d5a6 to 2290b92 Compare February 17, 2020 02:24
@valenad1 valenad1 force-pushed the PUBDEV-7138-extended-isolation-forest branch 3 times, most recently from 4c7bf7d to 31549f4 Compare February 18, 2020 19:54
@michalkurka michalkurka self-requested a review February 18, 2020 21:19
Copy link
Contributor

@maurever maurever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Minor changes are requested.

@maurever maurever self-requested a review September 8, 2020 09:18
@valenad1
Copy link
Collaborator Author

valenad1 commented Oct 5, 2020

Link: https://github.com/valenad1/h2o-3/blob/PUBDEV-7138-extended-isolation-forest-leak-fix/h2o-py/demos/extisofor/extended-isolation-forest-benchmark.ipynb

Extended Isolation Forest Benchmark - Training stage

In the tests, H2O cloud is initialized before each run and shutdown each every run. The algorithm is firstly tested on Training performance and compared to the Isolation Forest run time.

  • IF = Isolation Forest
  • EIF = Extended Isolation Forest
  • N = number of rows
  • P = number of columns
  • ntrees = number of trees to train (always 100)
  • sample_size = how many rows will be used to build a tree
  • max_depth = only for IF, how big is the depth of the tree, in EIF is always set on math.ceil(math.log(sample_size, 2)) and max_depth is always depended on sample_size in the benchmark.

Computer parameters:

  • Lenovo ThinkPad P53,
  • MS Windows 10 Pro x64,
  • Intel Core i7-9850H CPU @ 2.60GHz,
  • 6 cores and 12 threads,
  • 96.0 GB RAM.

Training process:

eif-parallel

The algorithm runtime performance depends on ntrees and sample_size parameters. Parameter ntrees is fixed on 100 for all runs. Setting of sample_size and the amount of data is varying in runs. I focus to put values that switch ON/OFF the MRTask.

Results

1.1 Toy data

1.1.1 Small data and small dimension

h2o-scale-perf_256_500_2_0 5679054737091065

1.1.2 Small data and high dimension

h2o-scale-perf_256_5000_30_1 6009737014770509

1.1.3 Big data - small dimension, small sample_size

h2o-scale-perf_256_1500000_37 48055565357208

1.1.4 Big data - high dimension, small sample_size

h2o-scale-perf_256_100000_11 623167252540588

1.1.5 Big data - small dimension, big sample_size

h2o-scale-perf_15000_1500000_57 57910377979279

1.1.6 Big data - high dimension, big sample_size

  • N = 100_000
  • P = 30
  • sample_size = 10_000

Finish with error (More information in conclusion)

H2OConnectionError: Unexpected HTTP error: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

1.2 Real Credit Card Fraud Detection Data

1.2.1 Test with samle_size = 256 (default value)

h2o-scale-perf_256_284807_30_26 633357524871826

1.2.2 Real data - sample_size = 1% (2 848 rows) of data

Finish with error (More information in conclusion)

H2OConnectionError: Unexpected HTTP error: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

Conclusion

  • The biggest problem I see across various runs is the sensitivity on P parameter no matter what the sample_size setting is. EIF is significantly slower than IF even for small data with 500 rows (section 1.1.2). If N grows and sample_size is small, then EIF finishes slower (sections 1.1.4, 1.2.1). The real problem comes with large data with big sample_size where EIF is not able to finish most probably because of memory requirements(sections 1.1.6, 1.2.2, 1.2.3). When I test it in Java, all trees are started to be prepared at once and save their Nodes and temporary Frames which leads to following warning message. With the same settings, the IF finish without any problem with good performance.

WARN water.default: Unblock allocations; cache below desired, but also OOM: OOM, (K/V:139,7 MB + POJO:19,27 GB + FREE:1,87 GB == MEM_MAX:21,27 GB), desiredKV=2,66 GB OOM!

  • If P parameter is small and sample_size is small, then EIF performs equally or better than IF.
  • If P parameter is small and sample_size is big, then EIF performs equally to IF.

Solution attempt:

I have tried to solve it with better distribution IsolationTree building. I hardcode to build 144 trees (12 on each thread) but with no difference in run time. Even if it works, I don't think it would be usable for a cluster where the performance of each node can and most probably will vary. That is why I only submitting tasks to the H2O cluster (one task per one tree). Trees are operating with their own sub-sample of data and can be built totally independently. But build all trees at one time seems to be memory and time-consuming.

@maurever
Copy link
Contributor

maurever commented Oct 5, 2020

I am sorry @honzasterba and @michalkurka, I sent you a new request for review, but I would like to know just your opinion on the benchmark right now.

But thank you @honzasterba for your new review. We agreed with your previous comments about removing MOJO and extension from SharedTreeModel. @valenad1 just has not implemented it yet, but it is on his todo list.

@valenad1 valenad1 changed the title PUBDEV-7138 Extended Isolation Forest PUBDEV-7138 - Extended Isolation Forest Jan 19, 2021
@valenad1 valenad1 changed the title PUBDEV-7138 - Extended Isolation Forest PUBDEV-7138 Extended Isolation Forest Jan 19, 2021
@valenad1 valenad1 changed the base branch from master to rel-zermelo January 20, 2021 18:04
@valenad1 valenad1 changed the base branch from rel-zermelo to master January 20, 2021 18:05
@valenad1
Copy link
Collaborator Author

The production version of algorithm will be finished here: #5246

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants