# Anomalies collector - step by step tutorial

<a href="https://colab.research.google.com/github/netdata/netdata-community/blob/main/netdata-agent-api/netdata-pandas/anomalies_collector_tutorial.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook will walk through, step by step, a worked example of how the [netdata anomalies collector](https://github.com/andrewm4894/netdata/tree/anomalies-collector/collectors/python.d.plugin/anomalies) works under the hood. 

**Note**: you can click the "Open in Colab" button above to open this notebook in [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb#recent=true) where you can just get going with it without having to set up python enviornments or any messy stuff like that.

In [251]:
# uncomment the below line to install required packages if needed.
#!pip install netdata-pandas==0.0.28 numba==0.50.1 scikit-learn==0.23.2 pyod==0.8.3

## Overview

There are three main functions in the anomalies collector that do most of the work:

- make_features(): a function to take raw data and create our feature vector used in the models.
- train(): a function to take our training data and train our models, one for each chart. 
- predict(): a function to take the most recent data and run it through the trained model's predict() method to get our anomsly probability and anomaly flag.  

In [252]:
import time
from datetime import datetime
import re

from IPython.display import display, Markdown
import numpy as np
import pandas as pd
from netdata_pandas.data import get_data, get_allmetrics
from pyod.models.hbos import HBOS
from pyod.models.pca import PCA
from pyod.models.cblof import CBLOF
from pyod.models.iforest import IForest


def make_features(df, lags_n, diffs_n, smooth_n):
    """Given a pandas dataframe preprocess it to take differences, add smoothing, and lags as specified. 
    """
    if diffs_n >= 1:
        df = df.diff(diffs_n).dropna()
    if smooth_n >= 2:
        df = df.rolling(smooth_n).mean().dropna()
    if lags_n >= 1:
        df_columns_new = [f'{col}_lag{n}' for n in range(lags_n+1) for col in df.columns]
        df = pd.concat([df.shift(n) for n in range(lags_n + 1)], axis=1).dropna()
        df.columns = df_columns_new
    # sort columns to have lagged values next to each other for clarity
    df = df.reindex(sorted(df.columns), axis=1)
    return df

In the next cell we will define all the inputs we will use in this tutorial. Feel free to play with them once you are familiar with how it all hangs together. 

In [253]:
# inputs
host = 'london.my-netdata.io'
# for this tutorial we will just use two charts, and so two models
charts_in_scope = ['system.cpu', 'system.load']
model = 'iforest'
train_n_secs = 14400
contamination = 0.001
offset_n_secs = 0
lags_n = 5
smooth_n = 3
diffs_n = 1
n_prediction_steps = 20

In [254]:
# initialize a model for each chart
if model == 'pca':
    models = {c: PCA(contamination=contamination, n_components=2, n_selected_components=2) for c in charts_in_scope}
elif model == 'hbos':
    models = {c: HBOS(contamination=contamination) for c in charts_in_scope}
elif model == 'cblof':
    models = {c: CBLOF(contamination=contamination, n_clusters=4) for c in charts_in_scope}
elif model == 'iforest':
    models = {c: IForest(contamination=contamination, n_estimators=50, bootstrap=True, behaviour='new') for c in charts_in_scope}
else:
    models = {c: HBOS(contamination=contamination) for c in charts_in_scope}

## Get training data

The first thing we need to do is get our raw training data for each chart we want to build a mnodel for.

In [255]:
# define the window for the training data to pull
before = int(datetime.now().timestamp()) - offset_n_secs
after =  before - train_n_secs

# get the training data
df_train = get_data(hosts=host, charts=charts_in_scope, after=after, before=before, sort_cols=True, numeric_only=True)
print(df_train.shape)
df_train.head()

(14400, 12)


Unnamed: 0_level_0,system.cpu|guest,system.cpu|guest_nice,system.cpu|iowait,system.cpu|irq,system.cpu|nice,system.cpu|softirq,system.cpu|steal,system.cpu|system,system.cpu|user,system.load|load1,system.load|load15,system.load|load5
time_idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1603824982,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.002506,1.002506,,,
1603824983,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.501253,0.501253,,,
1603824984,0.0,0.0,0.0,0.0,0.0,0.251889,0.0,0.755668,0.755668,,,
1603824985,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.753769,0.502513,0.03,0.0,0.01
1603824986,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.753769,1.005025,0.03,0.0,0.01


Above we can see our raw training data is just a pandas `DataFrame` with a timestamp index and a column for each metric from our `charts_in_scope` list. 

## Preprocess training data

Before we train our model we will first do some preprocessing to the raw data to create a "feature vector" to try and encode a more flexible and powerful representation for the model to work with as opposed to just looking at the most recently observed values in isolation. The idea here is to give the model some extra information so that it may spot more complex and interesting anomalies as opposed to just spikes where one metric is a very high or very low value.   

In [256]:
df_train_processed = make_features(df_train, lags_n, diffs_n, smooth_n)
print(df_train_processed.shape)
df_train_processed.head()

(14389, 72)


Unnamed: 0_level_0,system.cpu|guest_lag0,system.cpu|guest_lag1,system.cpu|guest_lag2,system.cpu|guest_lag3,system.cpu|guest_lag4,system.cpu|guest_lag5,system.cpu|guest_nice_lag0,system.cpu|guest_nice_lag1,system.cpu|guest_nice_lag2,system.cpu|guest_nice_lag3,...,system.load|load1_lag2,system.load|load1_lag3,system.load|load1_lag4,system.load|load1_lag5,system.load|load5_lag0,system.load|load5_lag1,system.load|load5_lag2,system.load|load5_lag3,system.load|load5_lag4,system.load|load5_lag5
time_idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1603824993,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1603824994,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1603824995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1603824996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1603824997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.003333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [257]:
# Lets look at how the shape of our data has changed due to preprocessing
print(f'df_train shape is {df_train.shape}')
print(f'df_train_processed is {df_train_processed.shape}')
n_cols_added = len(df_train_processed.columns)-len(df_train.columns)
print(f'make_features has added {n_cols_added} new columns, one for each lags_n ({df_train.shape[1]}*{lags_n}={n_cols_added})')

df_train shape is (14400, 12)
df_train_processed is (14389, 72)
make_features has added 60 new columns, one for each lags_n (12*5=60)


So as you can see from the above output, our preprocessing has added a new column for each `lags_n` specified. And we have lost a few rows due to `smooth_n` and `diffs_n`

To be super clear lets look at the first few rows of training data for a specific metric before and after preprocessing. 

In [258]:
metric = 'system.cpu|user'
print('raw data')
display(df_train[df_train.columns[df_train.columns.str.startswith(metric)]].head(1 + lags_n + smooth_n + diffs_n))
print('processed data')
display(df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(metric)]].head(1))
print('manualy calculated')
display(df_train[df_train.columns[df_train.columns.str.startswith(metric)]].diff(diffs_n).dropna().rolling(smooth_n).mean().head(1 + lags_n + smooth_n + diffs_n).tail(1))

raw data


Unnamed: 0_level_0,system.cpu|user
time_idx,Unnamed: 1_level_1
1603824982,1.002506
1603824983,0.501253
1603824984,0.755668
1603824985,0.502513
1603824986,1.005025
1603824987,1.507538
1603824988,0.75188
1603824989,0.253165
1603824990,0.498753
1603824991,0.503778


processed data


Unnamed: 0_level_0,system.cpu|user_lag0,system.cpu|user_lag1,system.cpu|user_lag2,system.cpu|user_lag3,system.cpu|user_lag4,system.cpu|user_lag5
time_idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1603824993,0.333749,0.25062,-0.0827,-0.336262,-0.25062,0.083122


manualy calculated


Unnamed: 0_level_0,system.cpu|user
time_idx,Unnamed: 1_level_1
1603824992,0.25062


Above you can see how one raw metric value is now being preprocessed to be a vector of `lags_n` differenced and smoothed values. It is this matrix of smoothed differences that the model will use for both training and during a predict step. 

## Train models

Now that we have our preprocessed training data we will train a model for each chart using this preprocessed feature represnetation that represents each time step for each chart as a lagged, smoothed metrix of differences for each chart.  

In [259]:
# loop over each chart in scope and train a model for each
for chart in charts_in_scope:
    X_train = df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(chart)]].values
    print(f'train model for {chart} using X_train of {X_train.shape}')
    models[chart] = models[chart].fit(X_train)

train model for system.cpu using X_train of (14389, 54)
train model for system.load using X_train of (14389, 18)


So we have now trained our models, one for each chart based on our preprocessed training data. To be concrete we will look at some example obvervations our model has been trained on. 

In [260]:
# lets look at the first matrix or "feature vector" for our first chart
obs_n = 0
print(f'timestamp={df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(charts_in_scope[0])]].index[obs_n]}')
print(f'feature vector for {obs_n}th training observation:')
print(df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(charts_in_scope[0])]].values[obs_n]) 

timestamp=1603824993
feature vector for 0th training observation:
[ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.0837521  -0.0835422  -0.0837521   0.          0.0835422
  0.08333333  0.          0.          0.          0.          0.
  0.1672901  -0.16877637  0.16876787  0.08187237 -0.0824799  -0.0841719
  0.33374897  0.25062017 -0.08270047 -0.33626153 -0.25062017  0.08312237]


In [261]:
# and the next one
obs_n = 1
print(f'timestamp={df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(charts_in_scope[0])]].index[obs_n]}')
print(f'feature vector for {obs_n}th training observation:')
print(df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(charts_in_scope[0])]].values[obs_n]) 

timestamp=1603824994
feature vector for 1th training observation:
[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  8.37521000e-02 -8.35422000e-02 -8.37521000e-02  0.00000000e+00
  0.00000000e+00  8.33333333e-02  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -8.39630667e-02  1.67290100e-01
 -1.68776367e-01  1.68767867e-01  8.18723667e-02 -8.24799000e-02
 -1.85037171e-17  3.33748967e-01  2.50620167e-01 -8.27004667e-02
 -3.36261533e-01 -2.50620167e-01]


If you look close enough at the above two cells you will see the same values be shifted for each lag.

Each matrix of numbers above is the representation we give to our model of each timestep. This is how the model views each chart - a matrix (or "feature vector" if you want to sound fancy) or floating point numbers encoding some differenced and smoothed information about the last `lags_n` observations for each dimension in the specific chart we are modelling. 

## Get pediction data

Now that we have our trained models for each chart we can use them in looking at incoming obsevarions and 'ask' the trained models how 'unusual' it thinks they are. 

In [262]:
# define a empty dataframe we can store enough recent data into to generate our feature vector for recent data on
df_recent = pd.DataFrame()
times = []
# simulate n_prediction_steps of getting latest data, making feature vecotr and getting predicitons
for prediction_step in range(n_prediction_steps):
    time.sleep(1)
    df_latest = get_allmetrics(host=host, charts=charts_in_scope, wide=True)[df_train.columns]
    df_latest['time_idx'] = int(time.time())
    df_latest = df_latest.set_index('time_idx')
    # just keep enough recent data to generate each feature vector
    df_recent = df_recent.append(df_latest).tail((lags_n + smooth_n + diffs_n) * 2)
    df_predict_processed = make_features(df_recent, lags_n, diffs_n, smooth_n)
print(f'we now have {df_predict_processed.shape[0]} recent preprocessed feature vectors to predict on.')

we now have 10 recent preprocessed feature vectors to predict on.


In [263]:
print(df_predict_processed.shape)
df_predict_processed.head()

(10, 72)


Unnamed: 0_level_0,system.cpu|guest_lag0,system.cpu|guest_lag1,system.cpu|guest_lag2,system.cpu|guest_lag3,system.cpu|guest_lag4,system.cpu|guest_lag5,system.cpu|guest_nice_lag0,system.cpu|guest_nice_lag1,system.cpu|guest_nice_lag2,system.cpu|guest_nice_lag3,...,system.load|load1_lag2,system.load|load1_lag3,system.load|load1_lag4,system.load|load1_lag5,system.load|load5_lag0,system.load|load5_lag1,system.load|load5_lag2,system.load|load5_lag3,system.load|load5_lag4,system.load|load5_lag5
time_idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1603839397,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,-0.006667,-0.006667,0.0,0.0,0.0,0.0,-0.003333,-0.003333
1603839398,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.003333,0.0,0.0,-0.006667,0.0,0.0,0.0,0.0,0.0,-0.003333
1603839399,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.003333,-0.003333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1603839400,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.003333,-0.003333,-0.003333,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1603839401,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,-0.003333,-0.003333,-0.003333,0.0,0.0,0.0,0.0,0.0,0.0


## Get predictions

In [264]:
# for each recent feature vector, get a prediction
for time_idx, row in df_predict_processed.iterrows():
    print(f'\npredictions for time {time_idx}\n')
    df_tmp = row.to_frame().transpose()
    for model in models:
        X_predict = df_tmp[df_tmp.columns[df_tmp.columns.str.startswith(model)]].values
        anomaly_probability = round(models[model].predict_proba(X_predict)[-1][1],4)
        anomaly_flag = models[model].predict(X_predict)[-1]
        print(f'model={model}, anomaly_probability={anomaly_probability}, anomaly_flag={anomaly_flag}')


predictions for time 1603839397

model=system.cpu, anomaly_probability=0.104, anomaly_flag=0
model=system.load, anomaly_probability=0.1714, anomaly_flag=0

predictions for time 1603839398

model=system.cpu, anomaly_probability=0.0652, anomaly_flag=0
model=system.load, anomaly_probability=0.1246, anomaly_flag=0

predictions for time 1603839399

model=system.cpu, anomaly_probability=0.0803, anomaly_flag=0
model=system.load, anomaly_probability=0.0378, anomaly_flag=0

predictions for time 1603839400

model=system.cpu, anomaly_probability=0.2049, anomaly_flag=0
model=system.load, anomaly_probability=0.0246, anomaly_flag=0

predictions for time 1603839401

model=system.cpu, anomaly_probability=0.226, anomaly_flag=0
model=system.load, anomaly_probability=0.0442, anomaly_flag=0

predictions for time 1603839402

model=system.cpu, anomaly_probability=0.2711, anomaly_flag=0
model=system.load, anomaly_probability=0.0807, anomaly_flag=0

predictions for time 1603839404

model=system.cpu, anomaly_

## But what _is_ the model?

To try and lift the lid a little on what the model actually is and how it if calculating anomaly probabilities lets take a look at one trained model and what it actually is. 

In [265]:
chart = charts_in_scope[0]
print(f'model for chart {chart}:')
models['system.cpu'].__dict__

model for chart system.cpu:


{'contamination': 0.001,
 'n_estimators': 50,
 'max_samples': 'auto',
 'max_features': 1.0,
 'bootstrap': True,
 'n_jobs': 1,
 'behaviour': 'new',
 'random_state': None,
 'verbose': 0,
 '_classes': 2,
 'detector_': IsolationForest(bootstrap=True, contamination=0.001, n_estimators=50, n_jobs=1),
 'decision_scores_': array([-0.11336808, -0.09236182, -0.06955371, ..., -0.11016781,
        -0.07864415, -0.13539968]),
 'threshold_': 4.1805480518325444e-16,
 'labels_': array([0, 0, 0, ..., 0, 0, 0]),
 '_mu': -0.14197090869713713,
 '_sigma': 0.030401437580786624}