Skip to content
stray {Search and TRace AnomalY}. Full paper is available from https://arxiv.org/pdf/1908.04000.pdf 🐶🐶🐶🐶🐱 🐶 🐶🐶🐶🐶
R
Branch: master
Clone or download
Latest commit 665cdf8 Aug 13, 2019
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
R
data added data Aug 13, 2019
docs update readme Feb 22, 2018
man added data Aug 13, 2019
.DS_Store removed animation package Jul 23, 2019
.Rbuildignore update examples in readme Feb 22, 2018
.gitignore update travi.yml May 27, 2019
.travis.yml update readme Aug 13, 2019
DESCRIPTION update readme Aug 13, 2019
NAMESPACE update knn search options Jul 28, 2019
README.Rmd added data Aug 13, 2019
README.md added data Aug 13, 2019
stray.Rproj added stray package Jan 3, 2018

README.md

output
github_document

stray {STReam AnomalY}

Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public. Licence

Build Status


minimal R version CRAN_Status_Badge packageversion


Last-changedate

Anomaly Detection in High Dimensional Data Space

This package is a modification of HDoutliers package. The HDoutliers algorithm is a powerful unsupervised algorithm for detecting anomalies in high-dimensional data, with a strong theoretical foundation. However, it suffers from some limitations that significantly hinder its performance level, under certain circumstances. In this package, we propose an algorithm that addresses these limitations. We define an anomaly as an observation that deviates markedly from the majority with a large distance gap. An approach based on extreme value theory is used for the anomalous threshold calculation.

A companion paper to this work is available here. Using various synthetic and real datasets, we demonstrate the wide applicability and usefulness of our algorithm, which we call the stray algorithm. We also demonstrate how this algorithm can assist in detecting anomalies present in other data structures using feature engineering. We show the situations where the stray algorithm outperforms the HDoutliers algorithm both in accuracy and computational time.

This package is still under development and this repository contains a development version of the R package stray.

Installation

You can install oddstream from github with:

# install.packages("devtools")
devtools::install_github("pridiltal/stray")

Example

One dimensional data set with one outlier

library(stray)
require(ggplot2)
#> Loading required package: ggplot2
set.seed(1234)
data <- c(rnorm(1000, mean = -6), 0, rnorm(1000, mean = 6))
outliers <- find_HDoutliers(data, knnsearchtype = "brute")
names(outliers)
#> [1] "outliers"   "out_scores" "type"
display_HDoutliers(data, outliers)

plot of chunk onedim

Two dimensional dataset with 8 outliers

set.seed(1234)
n <- 1000 # number of observations
nout <- 10 # number of outliers
typical_data <- tibble::as.tibble(matrix(rnorm(2*n), ncol = 2, byrow = TRUE))
#> Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
#> This warning is displayed once per session.
out <- tibble::as.tibble(matrix(5*runif(2*nout,min=-5,max=5), ncol = 2, byrow = TRUE))
data <- dplyr::bind_rows(out, typical_data )
outliers <- find_HDoutliers(data, knnsearchtype = "brute")
display_HDoutliers(data, outliers)

plot of chunk twodim

More examples are available from here

outliers<-find_HDoutliers(data_c[,1:2], knnsearchtype= "brute")
p <- display_HDoutliers(data_c[,1:2], outliers)+
      ggplot2::ggtitle("data_c")+
      theme(aspect.ratio = 1)

print(p)

plot of chunk dataa

outliers<-find_HDoutliers(data_d[,1:2], knnsearchtype= "brute")
p <- display_HDoutliers(data_d[,1:2], outliers)+
      ggplot2::ggtitle("data_d")+
      theme(aspect.ratio = 1)

print(p)

plot of chunk datad

You can’t perform that action at this time.