In [1]:
from __future__ import division, print_function
import numpy as np
from random_cut_forest import random_cut_forest as rcf  # Could also be import RandomCutForest
from sklearn import metrics

Suppose we have a multivariate normal data set X with a mean of zero and identity covariance matrix.

In [2]:
# generate a normally distributed dataset of dimension [n, p] called X
# this is normal non-anomalous data
n = 1000
p = 20
X = np.random.randn(n * p).reshape(n, p)

Now create a small number of anomalies.  These follow a different distribution, they have triple the variance.

In [3]:
# now add anomalies to the dataset
outlier_prob = .05
is_outlier = np.random.rand(n) > .95
n_outliers = np.sum(is_outlier)
X[is_outlier] = 3 * np.random.rand(n_outliers * p).reshape(n_outliers, p)

We will use a random cut forest to try to detect these anomalies while making no assumptions about the structure of the data.

First, we'll fit the random cut forest in batch mode.  

In [4]:
# run a batch job to build a random cut forest to identify what the anomalies in the dataset are
forest_batch = rcf.RandomCutForest(max_samples=128, random_features=False).fit(X)

In [5]:
scores_batch = forest_batch.decision_function(X)

Now, we'll fit the random cut forest in streaming mode.  Create an initial model with a small subset of the points, then stream in the remaining points.

In [6]:
# build a random cut forest with only a small sample of initial points
stream_init = 300
forest_stream = rcf.RandomCutForest(max_samples=128, random_features=False).fit(X[:stream_init])

In [7]:
# now stream in the remaining points
for i in range(stream_init, n):
    forest_stream.add_point(X[i])

In [8]:
scores_stream = forest_stream.decision_function(X)

Both models do a good job of detecting the anomalies.  

Note that these methods are stochastic, so the results will be different each time this notebook is run.

In [9]:
# both random cut forests produced good results at identifying the anomalies
print('batch random cut forest roc auc: ', metrics.roc_auc_score(is_outlier, -scores_batch))
print('streaming random cut forest roc auc: ', metrics.roc_auc_score(is_outlier, -scores_stream))

batch random cut forest roc auc:  0.9888216568248644
streaming random cut forest roc auc:  0.9956744672061432
