This project Implement the article of :
Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. "Isolation forest."Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on. IEEE, 2008.
IForest On Spark use spark to sampling data, and separate each partitoin to a spark worker. Each partition train n isolate trees. The train process is runing on paralle mode.
The prediction uses all isolation trees trained by spark, to predict the outlier factors.
SKLearn Iforest: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
Comparation: SKLearn Iforest: http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html
SVM OneClass: SVM OneClass Result:
IForest On Spark:
Project rely on spark-2.1.0-bin-hadoop2.7.Download at :http://spark.apache.org/downloads.html
How To Use:
var prop = new IForestProperty
prop.max_sample = 5000
prop.n_estimators = 1500
prop.max_depth_limit = (math.log(prop.max_sample) / math.log(2)).toInt
prop.bootstrap = true
prop.partition = 10
var ift = new IForestOnSpark(prop)
var data_mtx:DenseMatrix[Double] = ... (train data in matrix)
ift.fit(data_mtx, spark)
x: DenseVector[Double] = ... (test data)
var output = ift.predict(x)
//Serialize model to HDFS
var if_seralizer = new IForestSerializer
if_seralizer.serialize("hdfs://127.0.0.1/ifserialized", ift)
//Load model from HDFS
var if_loader = new IForestSerializer
var localmodel = if_loader.deserialize("hdfs://172.16.22.14:9000/ifserialized")