Skip to content

homarques/DAO

Repository files navigation

Dimensionality-Aware Outlier Detection (DAO)

Repository of the paper:

Dimensionality-Aware Outlier Detection
Alastair Anderberg, James Bailey, Ricardo J. G. B. Campello,
Michael E. Houle, Henrique O. Marques, Miloš Radovanović, Arthur Zimek
SDM24

In this paper, we present a nonparametric method for outlier detection that takes full account of local variations in intrinsic dimensionality within the dataset. Using the theory of Local Intrinsic Dimensionality (LID), our 'dimensionality-aware' outlier detection method, DAO, is derived as an estimator of an asymptotic local expected density ratio involving the query point and a close neighbor drawn at random. The dimensionality-aware behavior of DAO is due to its use of local estimation of LID values in a theoretically-justified way.

Through comprehensive experimentation on more than 800 synthetic and real datasets, we show that DAO significantly outperforms three popular and important benchmark outlier detection methods: Local Outlier Factor (LOF), Simplified LOF, and kNN.

Detailed numbers for all experiments are given in tables in the Supplementary Material


Repository setup

pip install -r requirements.txt

Downloading real datasets

Rscript R/downloadRealDatasets.r
Rscript R/preprocessing.r

Summary of real datasets

Rscript R/compileResults.r 'summaryRealDatasets'

Experimental Results

Evaluation of LID Estimation on DAO Performance

python run_synthetic.py
Rscript R/compileResults.r 'summaryResultsSyntheticDatasets'

Fig. 1. ROC AUC values for outlier detection performance over 480 synthetic datasets containing 2 clusters. One of the clusters (c1) has intrinsic dimension fixed at 8. The intrinsic dimension of the other cluster (c2) varies across the datasets (x-axis). The dashed vertical line indicates the reference set where both clusters lie on manifolds with the same intrinsic dimension (8). The results shown are averages over 30 datasets with the same characteristics. Bars indicate standard deviation.

Comparative Evaluation on Synthetic Datasets

Rscript R/compileResults.r 'lrSyntheticDatasets'

Comparative Evaluation on Real Datasets

python run_real.py
python stats.py

Simple linear regression

Rscript R/compileResults.r 'lrRealDatasets'

Visualizing Outlier Detection Performance

Rscript R/compileResults.r 'plot_R_MoransI'

Fig. 2. Differences in ROC AUC performance between DAOMLE and the dimensionality-unaware methods over 393 real datasets. Blue dots indicate datasets where DAO outperforms its competitor, whereas red dots indicate the opposite. The 'Oracle' method indicates the best-performing competitor for each individual dataset. Color intensity is proportional to the ROC AUC difference. On the x- and y-axis, we show Moran's I autocorrelation and dispersion R of log-LID estimates, respectively.

Critical Distance Diagram

Rscript R/compileResults.r 'plotCDRealDatasets'

Fig. 3. Critical difference diagram (significance level α = 1e-16) of average ranks of the methods on 393 real datasets: DAOMLE vs. baseline competitors.

Runtime Performance and Computational Complexity

python runtime.py
Rscript R/compileResults.r 'printRuntime'

About

Dimensionality-Aware Outlier Detection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published