This notebook showcases a workflow to benchmark an anomaly detection model against a dataset of timeseries with anomaly labels.

In [1]:
from mb_detect.dataloader import iter_reader, benchmarker
import numpy as np

Initialize the reader of the dataset and the model. For a simple example we use a isolation forest from sklearn.

In [2]:
nab_data_dir = "./data/nab/data/"
nab_label_dir = "./data/nab/labels/"
nab_reader = iter_reader.NabIter(data_dir=nab_data_dir, label_dir=nab_label_dir)
from sklearn.ensemble import IsolationForest
model = IsolationForest(n_estimators=100, warm_start=False, contamination=0.05)

We iterate over the dataset with the reader. Everytime we apply the model against the timeseries and calculate the f1-score. We aggregate the f1-score over the different categories of timeseries in the dataset.

In [3]:
def benchmark(model, reader):
    group_scores = {}
    for data, metadata in reader:
        pred_outliers = benchmarker.test_unsupervised(data, model, window_size=50)
        pred_outliers = pred_outliers == -1
        true_outliers = data["is_anomaly"]
        true_outliers = true_outliers.to_numpy()
        cm = benchmarker.compare_to_labels(true_outliers, pred_outliers)
        f1_score = benchmarker.f_one_score(cm)
        if (metadata["group"] not in group_scores.keys()):
            group_scores[metadata["group"]] = [f1_score]
        else:
            group_scores[metadata["group"]].append(f1_score)
    for group_name, scores in group_scores.items():
        print(group_name, np.round(np.mean(np.array(scores)), 2))

We can use this downloader if the Numenta dataset is not downloaded yet:

In [4]:
from mb_detect.dataloader import nab_downloader

nab_downloader.NabData(nab_path="./data/nab/")

nab data folder exists already .


<mb_detect.dataloader.nab_downloader.NabData at 0x7fbe7c778670>

In [5]:
benchmark(model, nab_reader)

realTweets 0.25
artificialWithAnomaly 0.29
realAWSCloudwatch 0.27
realKnownCause 0.36
realAdExchange 0.19
artificialNoAnomaly 0.2
realTraffic 0.34


We can conclude, that isolation forest does not sufficiently on these rather complex problems. However, with this API we can exchange components of the experiment with ease.

Replace the Numenta dataset with the Yahoo dataset. For this dataset, we cannot provide a downloader, because it is behind an authentication wall. It can be downloaded after requesting access [here](https://webscope.sandbox.yahoo.com/catalog.php?datatype=s&did=70)

In [6]:
yahoo_data_dir = "./data/yahoo/ydata-labeled-time-series-anomalies-v1_0/"
yahoo_reader = iter_reader.YahooIter(data_dir=yahoo_data_dir)

In [7]:
benchmark(model, yahoo_reader)

A2Benchmark 0.05
A4Benchmark 0.28
A3Benchmark 0.37
A1Benchmark 0.31
