Isolation Forest Implementation

The goal of this project is to implement the original Isolation Forest algorithm by Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. (A later version of this work is also available: Isolation-based Anomaly Detection.) There are two general approaches to anomaly detection:

model what normal looks like and then look for nonnormal observations
focus on the anomalies, which are few and different. This is the interesting and relatively-new approach taken by the authors of isolation forests.

Data sets

For this project, we'll use three data sets:

Kaggle credit card fraud competition data set; download, unzip to get creditcard.csv
Get the cancer.csv dataset: cancer.csv
Get the http.csv dataset: http.csv

Visualization of normal versus anomaly separation

Using plot_anomalies.py, which I provide for you, you can see the results of the isolation forest trying to detect anomalies. These data sets all have known targets indicating normal versus anomaly, but this information is only used during testing and not during training. In other words, we use this information to discover how well we can separate the distribution of normal versus anomalous observations. The section provides a number of results, but yours might look different because of the inherent randomness involved in selecting subsets of the data and constructing random trees. (click on the images to enlarge.)

http.csv, 200 trees, 99% desired TPR

creditcard.csv, 200 trees, 80% desired TPR	creditcard.csv, 200 trees, 90% desired TPR

cancer, 300 trees, 70% desired TPR	cancer, 300 trees, 80% desired TPR

Algorithm

For your convenience, here are the algorithms extracted from the Liu et al paper:

Please use this version of average path length c(), not the one in the original paper:

Then finally here's the scoring formula:

where "H(i) is the harmonic number and it can be estimated by ln(i) + 0.5772156649 (Euler’s constant)."

Scoring results

Using score.py, here is a sample run:

Running noise=False improved=False
INFO creditcard.csv fit time 0.23s
INFO creditcard.csv 18804 total nodes in 200 trees
INFO creditcard.csv score time 14.54s
SUCCESS creditcard.csv 200 trees at desired TPR 80.0% getting FPR 0.0300%

INFO http.csv fit time 0.28s
INFO http.csv 22430 total nodes in 300 trees
INFO http.csv score time 23.08s
SUCCESS http.csv 300 trees at desired TPR 99.0% getting FPR 0.0053%

INFO cancer.csv fit time 0.08s
INFO cancer.csv 8204 total nodes in 1000 trees
INFO cancer.csv score time 0.73s
SUCCESS cancer.csv 1000 trees at desired TPR 75.0% getting FPR 0.2857%

Due to the subsampling of the original data said and the inherent random nature of isolation forest, the results will differ even from run to run.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
images		images
README.md		README.md
cancer.csv		cancer.csv
http.csv		http.csv
iforest.py		iforest.py
iforest_paper_icdm08b.pdf		iforest_paper_icdm08b.pdf
plot_anomalies.py		plot_anomalies.py
score.py		score.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

README.md

README.md

cancer.csv

cancer.csv

http.csv

http.csv

iforest.py

iforest.py

iforest_paper_icdm08b.pdf

iforest_paper_icdm08b.pdf

plot_anomalies.py

plot_anomalies.py

score.py

score.py

Repository files navigation

Isolation Forest Implementation

Data sets

Visualization of normal versus anomaly separation

Algorithm

Scoring results

About

Releases

Packages

Languages

joy-wj/isolation-forest

Folders and files

Latest commit

History

Repository files navigation

Isolation Forest Implementation

Data sets

Visualization of normal versus anomaly separation

Algorithm

Scoring results

About

Topics

Resources

Stars

Watchers

Forks

Languages