# Anomalies collector - step by step tutorial

<a href="https://colab.research.google.com/github/andrewm4894/netdata-community/blob/main/netdata-agent-api/netdata-pandas/anomalies_collector_tutorial.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook will walk through, step by step, a worked example of how the [netdata anomalies collector](https://github.com/andrewm4894/netdata/tree/anomalies-collector/collectors/python.d.plugin/anomalies) works under the hood. 

**Note**: you can click the "Open in Colab" button above to open this notebook in [Google Colab](https://colab.research.google.com/notebooks/intro.ipynb#recent=true) where you can just get going with it without having to set up python enviornments or any messy stuff like that.

In [1]:
# uncomment the below line to install required packages if needed.
#!pip install netdata-pandas==0.0.28 numba==0.50.1 scikit-learn==0.23.2 pyod==0.8.3

## Overview

There are three main concepts central to what the anomalies collector does:

- **featurization**: This is how we take the raw data for each chart and preprocess it into a feature representation or "[feature vector](https://en.wikipedia.org/wiki/Feature_(machine_learning)" used by the model. A simple way to think of this is that we just take each row of data and add some extra columns to encode some additional information, like for example a smoothed average of the last `lags_n` values for each dimension on the chart so the model can have some knowledge of the recent past beyond just the latest raw values of the dimensions on the chart. 
- **training**: A function to take our "featurized" training data and train our models, one for each chart. This function will do slighly different things depending on what model you use but in a broad sense its job is to train a model that gets good at 'reconstructing' the featurized training data from itself. Some other models might take a slightly different approach and instead of trying to reconstruct the training data will learn a function that can give you a measure of suprise for each feature vector without explicitly trying to reconstruct the data it is trained on. For the purpose of what we are doing this is largely abscracted away by the API of the PyOD library, such that as a user we can easily swith between various models and still have broadly the same inputs and outputs.     
- **prediction**: Each trained model then has a predict() function that we can use by passing in a new feature vector and getting back an anomaly probability and anomaly flag from the trained model. This is the part where we actually use the trained model and as new data arrives we basically ask it - "how unusual does this feature vector look to you?"  

## Lets go!

In [2]:
import time
from datetime import datetime
import re

from IPython.display import display, Markdown
import numpy as np
import pandas as pd
from netdata_pandas.data import get_data, get_allmetrics
from pyod.models.hbos import HBOS
from pyod.models.pca import PCA
from pyod.models.cblof import CBLOF
from pyod.models.iforest import IForest


def make_features(df, lags_n, diffs_n, smooth_n):
    """Given a pandas dataframe preprocess it to take differences, add smoothing, and lags as specified. 
    """
    if diffs_n >= 1:
        # take differences
        df = df.diff(diffs_n).dropna()
    if smooth_n >= 2:
        # apply a rolling average to smooth out the data a bit
        df = df.rolling(smooth_n).mean().dropna()
    if lags_n >= 1:
        # for each dimension add a new columns for each of lags_n lags of the differenced and smoothed values for that dimension
        df_columns_new = [f'{col}_lag{n}' for n in range(lags_n+1) for col in df.columns]
        df = pd.concat([df.shift(n) for n in range(lags_n + 1)], axis=1).dropna()
        df.columns = df_columns_new
    # sort columns to have lagged values next to each other for clarity when looking at the feature vectors
    df = df.reindex(sorted(df.columns), axis=1)
    
    return df


## Inputs & configuration

In the next cell we will define all the inputs we will use in this tutorial. Feel free to play with them once you are familiar with how it all hangs together.

Below you will see that the paramater values map to a subset of the inputs (the most important ones that will help explain whats going on) required as part of the [`anomalies.conf`](https://github.com/andrewm4894/netdata/blob/anomalies-collector/collectors/python.d.plugin/anomalies/anomalies.conf) configuration for the anomalies collector itself.

In [3]:
# inputs

# what host will we use
host = 'london.my-netdata.io'
# for this tutorial we will just use two charts, and so two models
charts_in_scope = ['system.cpu', 'system.load', 'system.net', 'system.io']
# what model from PyOD will we use under the hood
model = 'pca'
# how many seconds of data will we train our models on
train_n_secs = 14400
# what contamination rate will we use, see some discussion here to understand this one more: https://github.com/yzhao062/pyod/issues/144
contamination = 0.001
# if we want to ignore a recent window of data when training the model we can use this
offset_n_secs = 0
# how many lags to include in our feature vector
lags_n = 5
# how much smoothing to apply in our feature vector
smooth_n = 3
# if we want to do everything in terms of differences then we set diffs_n=1
diffs_n = 1
# for purpose of this turorial how many prediction steps will we take once we have a trained model
n_prediction_steps = 20

Now we will initialize a PyOD model for each chart in `charts_in_scope`. Each model in PyOD will have various different input paramaters that a user can play with, we will tend to use the defaults and overide them sometimes with ones we have picked based on what we know about the task we are working on. Generally these model paramaters, apart from contamination, are hardcoded into the anomalies collector based on our internal research as we developed the collector, you can see this in the [collector code here](https://github.com/andrewm4894/netdata/blob/anomalies-collector/collectors/python.d.plugin/anomalies/anomalies.chart.py#L77).

In the cell below we have added a comment for the source and API reference of each model from PyOD so you can take a look and read more about each one.

By default the anomalies collector uses the `PCA` model, primarially this is because the pca model gives a good combination of being able to capture and model flexible patterns in the data while also being computationally fast since under the hood it is using the well researched, optimized and understood [SVD](https://en.wikipedia.org/wiki/Singular_value_decomposition) algorithim to decompose our featurized data and project it onto a lower dimensional space. At a high level, when we see new data that is in a strange or unexpected part of this lower dimensional space then this is symptomatic of some anomalous data and so will get a higher anomaly score. 

- api: https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.pca
- source: https://pyod.readthedocs.io/en/latest/_modules/pyod/models/pca.html

In [4]:
# initialize a model for each chart
if model == 'pca':
    # api: https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.pca
    # source: https://pyod.readthedocs.io/en/latest/_modules/pyod/models/pca.html
    models = {c: PCA(contamination=contamination, n_components=2, n_selected_components=2) for c in charts_in_scope}
elif model == 'hbos':
    # api: https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.hbos
    # source: https://pyod.readthedocs.io/en/latest/_modules/pyod/models/hbos.html
    models = {c: HBOS(contamination=contamination) for c in charts_in_scope}
elif model == 'cblof':
    # api: https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.cblof
    # source: https://pyod.readthedocs.io/en/latest/_modules/pyod/models/cblof.html
    models = {c: CBLOF(contamination=contamination, n_clusters=4) for c in charts_in_scope}
elif model == 'iforest':
    # api: https://pyod.readthedocs.io/en/latest/pyod.models.html#module-pyod.models.iforest
    # source: https://pyod.readthedocs.io/en/latest/_modules/pyod/models/iforest.html
    models = {c: IForest(contamination=contamination, n_estimators=50, bootstrap=True, behaviour='new') for c in charts_in_scope}
else:
    # we used the HBOS as default as it is both fast and robust to many different types of data and has proven in internal development 
    # to have less failure modes then some other models given the wide variaty of data we are expecting to be thrown at it
    models = {c: HBOS(contamination=contamination) for c in charts_in_scope}

## Get training data

The first thing we need to do is get our raw training data for each chart we want to build a model for.

To get the data we will make use of the [netdata-pandas](https://github.com/netdata/netdata-pandas) library we have built to make multiple asynchronous calls to the [Netdata REST API](https://learn.netdata.cloud/docs/agent/web/api) and basically wrangle the results into a nice [Pandas](https://pandas.pydata.org/) [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)

In [5]:
# define the window for the training data to pull
before = int(datetime.now().timestamp()) - offset_n_secs
after =  before - train_n_secs

# get the training data
df_train = get_data(hosts=host, charts=charts_in_scope, after=after, before=before, sort_cols=True, numeric_only=True, float_size='float32')
print(df_train.info())
df_train.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 14399 entries, 1603872875 to 1603887273
Data columns (total 16 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   system.cpu|guest       14399 non-null  float32
 1   system.cpu|guest_nice  14399 non-null  float32
 2   system.cpu|iowait      14399 non-null  float32
 3   system.cpu|irq         14399 non-null  float32
 4   system.cpu|nice        14399 non-null  float32
 5   system.cpu|softirq     14399 non-null  float32
 6   system.cpu|steal       14399 non-null  float32
 7   system.cpu|system      14399 non-null  float32
 8   system.cpu|user        14399 non-null  float32
 9   system.io|in           14399 non-null  float32
 10  system.io|out          14399 non-null  float32
 11  system.load|load1      14399 non-null  float32
 12  system.load|load15     14399 non-null  float32
 13  system.load|load5      14399 non-null  float32
 14  system.net|received    14399 non-null  f

Unnamed: 0_level_0,system.cpu|guest,system.cpu|guest_nice,system.cpu|iowait,system.cpu|irq,system.cpu|nice,system.cpu|softirq,system.cpu|steal,system.cpu|system,system.cpu|user,system.io|in,system.io|out,system.load|load1,system.load|load15,system.load|load5,system.net|received,system.net|sent
time_idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1603872875,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.501253,0.501253,0.0,0.0,0.05,0.03,0.07,58.23505,-55.117802
1603872876,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.755668,0.251889,0.0,0.0,0.05,0.03,0.07,58.105171,-156.052414
1603872877,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.755668,0.755668,0.0,0.0,0.05,0.03,0.07,199.398605,-99.251968
1603872878,0.0,0.0,0.0,0.0,0.0,0.250627,0.0,0.75188,0.501253,0.0,-48.289742,0.05,0.03,0.07,155.327377,-78.049698
1603872879,0.0,0.0,0.0,0.0,0.0,0.251889,0.0,0.503778,1.007557,0.0,-23.71026,0.05,0.03,0.07,113.740067,-110.700722


Above we can see our raw training data is just a pandas `DataFrame` with a timestamp index and a column for each dimension from our `charts_in_scope` list.

**Note**: The [netdata-pandas](https://github.com/netdata/netdata-pandas) default naming convention for columns is "chart.name|dimension.name" 

## Preprocess training data

Before we train our model we will first do some preprocessing to the raw data to create a "feature vector" to try and encode a more flexible and powerful representation for the model to work with as opposed to just looking at the most recently observed values in isolation. 

This is the "featurization" we mentioned at the begining of the notebook. The idea here is to give the model some extra information so that it may spot more complex and interesting anomalies as opposed to just spikes where one metric is a very high or very low value.   

In [6]:
# lets preprocess or "featurize" our raw data
df_train_processed = make_features(df_train, lags_n, diffs_n, smooth_n)

# print out the shape of our featurized data
print(df_train_processed.shape)
df_train_processed.head()

(14391, 96)


Unnamed: 0_level_0,system.cpu|guest_lag0,system.cpu|guest_lag1,system.cpu|guest_lag2,system.cpu|guest_lag3,system.cpu|guest_lag4,system.cpu|guest_lag5,system.cpu|guest_nice_lag0,system.cpu|guest_nice_lag1,system.cpu|guest_nice_lag2,system.cpu|guest_nice_lag3,...,system.net|received_lag2,system.net|received_lag3,system.net|received_lag4,system.net|received_lag5,system.net|sent_lag0,system.net|sent_lag1,system.net|sent_lag2,system.net|sent_lag3,system.net|sent_lag4,system.net|sent_lag5
time_idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1603872883,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-33.064599,-31.864746,18.544963,32.364106,23.657443,23.487494,6.468403,-3.333796,15.117231,-7.643967
1603872884,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,29.220114,-33.064599,-31.864746,18.544963,-29.390568,23.657443,23.487494,6.468403,-3.333796,15.117231
1603872885,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,2.987481,29.220114,-33.064599,-31.864746,-17.817933,-29.390568,23.657443,23.487494,6.468403,-3.333796
1603872886,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-2.315852,2.987481,29.220114,-33.064599,4.513982,-17.817933,-29.390568,23.657443,23.487494,6.468403
1603872887,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-50.601482,-2.315852,2.987481,29.220114,37.946377,4.513982,-17.817933,-29.390568,23.657443,23.487494


The below few cells will explore a little what we have just done to try and make the ideas of preprocessing aka "featurization" aka "feature vector" a little clearer.

Terms like "featurization" and "feature vector" are often used to sound fancy, but in reality its typically just as simple as adding additional columns to each row of your data, where those new columns have numbers in them that represent something about your data that you want to make available to the model. 

So in our case adding lagged values of each smoothed and differenced dimension, is basically a design choice we make whereby we are telling the model we want it to consider `lags_n` recent values as opposed to just the latest observed dimensions. We do this because there are many [different types of anomalies](https://andrewm4894.com/2020/10/19/different-types-of-time-series-anomalies/) we want to try and be able to spot, so making a small snippet of recent data for each dimension available to the model gives us the ability to capture more complex anomaly patterns that might happen.

If we were to just train the model on the most recent values for each dimension the best we could reasonably hope for it to capture would be anomalies where one or more dimension takes an unusually high or low value for one time step. This is essentially not that much better then a traditional approach using z-scores. (If you are interested in comparing the two we actually also have a [zscores collector](https://github.com/andrewm4894/netdata/tree/zscores-collector/collectors/python.d.plugin/zscores) on the way too, if you would like to just start simple or cannot install the ML Python libraries the anomalies collector depends on for example). 

In [7]:
# Lets look at how the shape of our data has changed due to preprocessing
print(f'df_train shape is {df_train.shape}')
print(f'df_train_processed is {df_train_processed.shape}')
n_cols_added = len(df_train_processed.columns)-len(df_train.columns)
print(f'make_features has added {n_cols_added} new columns, one for each lags_n ({df_train.shape[1]}*{lags_n}={n_cols_added})')

df_train shape is (14399, 16)
df_train_processed is (14391, 96)
make_features has added 80 new columns, one for each lags_n (16*5=80)


So as you can see from the above output, our featurization has added a new column for each `lags_n` specified. And we have also lost a few rows due to `smooth_n` and `diffs_n`

To be super clear lets look at the first few rows of training data for a specific metric before and after preprocessing. 

**Note**: Look at the last `time_idx` to see how the featurization works for a specific timestamp of data.

In [8]:
metric = 'system.cpu|user'
print('raw data')
display(df_train[df_train.columns[df_train.columns.str.startswith(metric)]].head(3 + lags_n + smooth_n + diffs_n))

raw data


Unnamed: 0_level_0,system.cpu|user
time_idx,Unnamed: 1_level_1
1603872875,0.501253
1603872876,0.251889
1603872877,0.755668
1603872878,0.501253
1603872879,1.007557
1603872880,0.757576
1603872881,0.503778
1603872882,0.50505
1603872883,1.503759
1603872884,1.002506


In [9]:
print('featurized data')
display(df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(metric)]].head(1))

featurized data


Unnamed: 0_level_0,system.cpu|user_lag0,system.cpu|user_lag1,system.cpu|user_lag2,system.cpu|user_lag3,system.cpu|user_lag4,system.cpu|user_lag5
time_idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1603872883,0.248728,-0.167502,0.000842,0.000636,0.251889,9.934107e-09


In [10]:
print('manualy calculated')
# here we take differences and smooth values for one specific dimension's latest value e.g. lag0
# the same calculation is donw for each lag, obiously just in a shifted manner
display(
    df_train[df_train.columns[df_train.columns.str.startswith(metric)]]\
    .diff(diffs_n).dropna()\
    .rolling(smooth_n).mean()\
    .head(2 + lags_n + smooth_n + diffs_n).tail(1)
)

manualy calculated


Unnamed: 0_level_0,system.cpu|user
time_idx,Unnamed: 1_level_1
1603872886,-0.249997


Above you can see how one raw metric value is now being preprocessed to be a vector of `lags_n` differenced and smoothed values. It is this matrix of smoothed differences that the model will use for both training and during a predict step. 

So, for example, if a chart has 3 dimensions and we have set `lags_n` to be 5 then our featurized 'matrix' of numbers will be a 3*(1+5) matrix. In reality this matrix is just flattened into a feature vector of 3 * (1+5) = 18 floating point values. The cell below shows this for the `system.load` chart as that is an example with 3 dimensions. 

In [11]:
# lets look at our first feature vector for the 'system.load' model 
print(df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith('system.load')]].head(1).shape)
print(df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith('system.load')]].head(1).values)

(1, 18)
[[ 0.          0.          0.          0.          0.          0.
   0.         -0.00333333 -0.00333333 -0.00333333  0.          0.
   0.          0.          0.          0.          0.          0.        ]]


## Train models

Now that we have our preprocessed training data we will train a model for each chart using our featurized data that represents each time step for each chart as a differenced, smoothed, and lagged matrix for each chart.

In [12]:
# loop over each chart in scope and train a model for each
for chart in charts_in_scope:
    # pull out the columns relating to the chart based on what thier name startswith and put it into a numpy array of values
    X_train = df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(chart)]].values
    print(f'train model for {chart} using X_train of {X_train.shape}')
    # call the fit() method on each initialized model and pass it the full numpy array of our featurized training data
    models[chart] = models[chart].fit(X_train)

train model for system.cpu using X_train of (14391, 54)
train model for system.load using X_train of (14391, 18)
train model for system.net using X_train of (14391, 12)
train model for system.io using X_train of (14391, 12)


So we have now trained our models, one for each chart based on our preprocessed training data. To be concrete we will look at some example obvervations our model has been trained on. 

In [13]:
# lets look at the first matrix or "feature vector" for our first chart for out first model
obs_n = 0
model_n = 0
print(f'timestamp={df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(charts_in_scope[model_n])]].index[obs_n]}')
print(f'feature vector for {obs_n}th training observation for {charts_in_scope[model_n]} model:')
print(df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(charts_in_scope[model_n])]].values[obs_n]) 

timestamp=1603872883
feature vector for 0th training observation for system.cpu model:
[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00 -8.41750999e-02 -8.39630663e-02
 -8.35421979e-02  8.41750999e-02  8.39630663e-02  8.35421979e-02
  0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
  0.00000000e+00  0.00000000e+00  2.49360780e-01  4.24067179e-04
 -8.27004711e-02 -8.35390091e-02 -8.39630763e-02  8.35421880e-02
  2.48727858e-01 -1.67502065e-01  8.41716925e-04  6.36100769e-04
  2.51889169e-01  9.93410746e-09]


In [14]:
# and the next one
obs_n = 1
model_n = 0
print(f'timestamp={df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(charts_in_scope[model_n])]].index[obs_n]}')
print(f'feature vector for {obs_n}th training observation for {charts_in_scope[model_n]} model:')
print(df_train_processed[df_train_processed.columns[df_train_processed.columns.str.startswith(charts_in_scope[model_n])]].values[obs_n]) 

timestamp=1603872884
feature vector for 1th training observation for system.cpu model:
[ 0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.          0.          0.          0.          0.          0.
  0.         -0.0841751  -0.08396307 -0.0835422   0.0841751   0.08396307
  0.          0.          0.          0.          0.          0.
 -0.00084172  0.24936078  0.00042407 -0.08270047 -0.08353901 -0.08396308
  0.16624266  0.24872786 -0.16750207  0.00084172  0.0006361   0.25188917]


If you look close enough at the above two cells you will see the same values be shifted for each lag.

Each matrix of numbers above is the representation we give to our model of each timestep. This is how the model views each chart - a matrix (or "feature vector" if you want to sound fancy) of floating point numbers encoding some differenced and smoothed information about the last `lags_n` observations for each dimension in the specific chart we are modelling. 

**Note**: Within the anomalies collector, at some regular interval, as defined by `train_every_n` in the `anomalies.conf` file, we will repeat the above training step to essentially retrain all models on the most recent window of available training data. 

## Get pediction data

Now that we have our trained models for each chart we can use them in looking at incoming obsevarions and 'ask' the trained models how 'unusual' it thinks they are. 

In [15]:
# define a empty dataframe we can store enough recent data into to generate our feature vector for recent data on
df_recent = pd.DataFrame()
times = []

# simulate n_prediction_steps of getting latest data, making feature vecotr and getting predicitons
for prediction_step in range(n_prediction_steps):
    time.sleep(1)
    df_latest = get_allmetrics(host=host, charts=charts_in_scope, wide=True)[df_train.columns]
    df_latest['time_idx'] = int(time.time())
    df_latest = df_latest.set_index('time_idx')
    # just keep enough recent data to generate each feature vector
    df_recent = df_recent.append(df_latest).tail((lags_n + smooth_n + diffs_n) * 2)
    
    # now lets featurize our recent data to be able to get predictions from the model for each observation
    df_predict_processed = make_features(df_recent, lags_n, diffs_n, smooth_n)

print(f'we now have {df_predict_processed.shape[0]} recent preprocessed feature vectors to predict on.')

we now have 10 recent preprocessed feature vectors to predict on.


In [16]:
print(df_predict_processed.shape)
df_predict_processed.head()

(10, 96)


Unnamed: 0_level_0,system.cpu|guest_lag0,system.cpu|guest_lag1,system.cpu|guest_lag2,system.cpu|guest_lag3,system.cpu|guest_lag4,system.cpu|guest_lag5,system.cpu|guest_nice_lag0,system.cpu|guest_nice_lag1,system.cpu|guest_nice_lag2,system.cpu|guest_nice_lag3,...,system.net|received_lag2,system.net|received_lag3,system.net|received_lag4,system.net|received_lag5,system.net|sent_lag0,system.net|sent_lag1,system.net|sent_lag2,system.net|sent_lag3,system.net|sent_lag4,system.net|sent_lag5
time_idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1603887288,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,4.067934,24.261251,43.957575,-71.901983,44.71267,-30.4595,-28.263083,-34.664116,-38.637143,33.655889
1603887289,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-45.323688,4.067934,24.261251,43.957575,40.852385,44.71267,-30.4595,-28.263083,-34.664116,-38.637143
1603887290,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,28.951936,-45.323688,4.067934,24.261251,23.38636,40.852385,44.71267,-30.4595,-28.263083,-34.664116
1603887291,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,27.406867,28.951936,-45.323688,4.067934,-80.917204,23.38636,40.852385,44.71267,-30.4595,-28.263083
1603887293,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-6.052823,27.406867,28.951936,-45.323688,-26.372788,-80.917204,23.38636,40.852385,44.71267,-30.4595


The above featureized prediction data should be identical in terms of structure and schema to the featurized training data we explored above. This is what is expected by the model.  

## Get predictions

In [17]:
# for each recent feature vector, get a prediction
for time_idx, row in df_predict_processed.iterrows():
    
    print(f'\npredictions for time {time_idx}\n')
    
    # convert our row into the expected 'flattened' feature vector
    df_tmp = row.to_frame().transpose()
    
    for model in models:
        
        # pull out relevant array of features for the model in question
        X_predict = df_tmp[df_tmp.columns[df_tmp.columns.str.startswith(model)]].values
        
        # call the predict_proba() and predict() methods on the trained data in order to make a prediction
        anomaly_probability = round(models[model].predict_proba(X_predict)[-1][1],4)
        anomaly_flag = models[model].predict(X_predict)[-1]
        
        print(f'model={model}, anomaly_probability={anomaly_probability}, anomaly_flag={anomaly_flag}')



predictions for time 1603887288

model=system.cpu, anomaly_probability=0.0674, anomaly_flag=0
model=system.load, anomaly_probability=0.0005, anomaly_flag=0
model=system.net, anomaly_probability=0.068, anomaly_flag=0
model=system.io, anomaly_probability=0.0293, anomaly_flag=0

predictions for time 1603887289

model=system.cpu, anomaly_probability=0.0723, anomaly_flag=0
model=system.load, anomaly_probability=0.0005, anomaly_flag=0
model=system.net, anomaly_probability=0.0615, anomaly_flag=0
model=system.io, anomaly_probability=0.0265, anomaly_flag=0

predictions for time 1603887290

model=system.cpu, anomaly_probability=0.079, anomaly_flag=0
model=system.load, anomaly_probability=0.0005, anomaly_flag=0
model=system.net, anomaly_probability=0.0517, anomaly_flag=0
model=system.io, anomaly_probability=0.0215, anomaly_flag=0

predictions for time 1603887291

model=system.cpu, anomaly_probability=0.0615, anomaly_flag=0
model=system.load, anomaly_probability=0.0005, anomaly_flag=0
model=syste

In the above will probably see generally low `anomaly_probability` values assuming nothing has blown up on the host you used between the time you ran the training cells above and the predictions above. 

Lets just do one last little thing to try show what is going on here and why we put so much effort and focus into the featurization above.

We will take one of the last feature vectors we predicted on for each model, randomly shuffle the values around so as to make an unusual looking observations, and see what sort of an anomaly probability that gives us. (hint: it should be higher then those above :) ).

In [18]:
df_predict_shuffled = df_predict_processed.tail(1).transpose().sample(frac=1).transpose()
df_predict_shuffled.columns = df_predict_processed.columns # rename things to really shuffle things
for model in models:
        X_predict = df_predict_shuffled[df_predict_shuffled.columns[df_predict_shuffled.columns.str.startswith(model)]].values
        anomaly_probability = round(models[model].predict_proba(X_predict)[-1][1],4)
        anomaly_flag = models[model].predict(X_predict)[-1]
        print(f'model={model}, anomaly_probability={anomaly_probability}, anomaly_flag={anomaly_flag}')

model=system.cpu, anomaly_probability=1.0, anomaly_flag=1
model=system.load, anomaly_probability=1.0, anomaly_flag=1
model=system.net, anomaly_probability=0.0, anomaly_flag=0
model=system.io, anomaly_probability=1.0, anomaly_flag=1


## But what _is_ the model?

To try and lift the lid a little on what the model actually is and how it is calculating anomaly probabilities lets take a look at one trained model and what it actually is. 

In [19]:
chart = charts_in_scope[0]
print(f'model for chart {chart}:')
models[chart].__dict__

model for chart system.cpu:


{'contamination': 0.001,
 'n_components': 2,
 'n_selected_components': 2,
 'copy': True,
 'whiten': False,
 'svd_solver': 'auto',
 'tol': 0.0,
 'iterated_power': 'auto',
 'random_state': None,
 'weighted': True,
 'standardization': True,
 '_classes': 2,
 'scaler_': StandardScaler(),
 'detector_': PCA(n_components=2),
 'n_components_': 2,
 'components_': array([[ 4.08292445e-18,  9.21720327e-19,  1.78582782e-18,
          1.46173196e-19, -5.87271463e-19, -4.31308482e-18,
          7.24701690e-19, -5.05469846e-19, -1.07446078e-19,
         -1.15569631e-19, -1.36097217e-20, -0.00000000e+00,
         -3.12382980e-03,  6.81363698e-02,  3.53673187e-02,
         -1.63879621e-02, -8.26808793e-02, -2.63260556e-02,
         -0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
         -0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
         -0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
         -0.00000000e+00, -0.00000000e+00, -0.00000000e+00,
         -6.59145320e-02, -2.79651279e-03,  6

In [20]:
help(models[chart].predict)

Help on method predict in module pyod.models.base:

predict(X) method of pyod.models.pca.PCA instance
    Predict if a particular sample is an outlier or not.
    
    Parameters
    ----------
    X : numpy array of shape (n_samples, n_features)
        The input samples.
    
    Returns
    -------
    outlier_labels : numpy array of shape (n_samples,)
        For each observation, tells whether or not
        it should be considered as an outlier according to the
        fitted model. 0 stands for inliers and 1 for outliers.



In [21]:
help(models[chart].predict_proba)

Help on method predict_proba in module pyod.models.base:

predict_proba(X, method='linear') method of pyod.models.pca.PCA instance
    Predict the probability of a sample being outlier. Two approaches
    are possible:
    
    1. simply use Min-max conversion to linearly transform the outlier
       scores into the range of [0,1]. The model must be
       fitted first.
    2. use unifying scores, see :cite:`kriegel2011interpreting`.
    
    Parameters
    ----------
    X : numpy array of shape (n_samples, n_features)
        The input samples.
    
    method : str, optional (default='linear')
        probability conversion method. It must be one of
        'linear' or 'unify'.
    
    Returns
    -------
    outlier_probability : numpy array of shape (n_samples,)
        For each observation, tells whether or not
        it should be considered as an outlier according to the
        fitted model. Return the outlier probability, ranging
        in [0,1].



In [22]:
PCA.fit??

[1;31mSignature:[0m [0mPCA[0m[1;33m.[0m[0mfit[0m[1;33m([0m[0mself[0m[1;33m,[0m [0mX[0m[1;33m,[0m [0my[0m[1;33m=[0m[1;32mNone[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mSource:[0m   
    [1;32mdef[0m [0mfit[0m[1;33m([0m[0mself[0m[1;33m,[0m [0mX[0m[1;33m,[0m [0my[0m[1;33m=[0m[1;32mNone[0m[1;33m)[0m[1;33m:[0m[1;33m
[0m        [1;34m"""Fit detector. y is ignored in unsupervised methods.

        Parameters
        ----------
        X : numpy array of shape (n_samples, n_features)
            The input samples.

        y : Ignored
            Not used, present for API consistency by convention.

        Returns
        -------
        self : object
            Fitted estimator.
        """[0m[1;33m
[0m        [1;31m# validate inputs X and y (optional)[0m[1;33m
[0m        [0mX[0m [1;33m=[0m [0mcheck_array[0m[1;33m([0m[0mX[0m[1;33m)[0m[1;33m
[0m        [0mself[0m[1;33m.[0m[0m_set_n_classes[0m[1;33m([0m[0m

In [23]:
PCA.predict??

[1;31mSignature:[0m [0mPCA[0m[1;33m.[0m[0mpredict[0m[1;33m([0m[0mself[0m[1;33m,[0m [0mX[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mSource:[0m   
    [1;32mdef[0m [0mpredict[0m[1;33m([0m[0mself[0m[1;33m,[0m [0mX[0m[1;33m)[0m[1;33m:[0m[1;33m
[0m        [1;34m"""Predict if a particular sample is an outlier or not.

        Parameters
        ----------
        X : numpy array of shape (n_samples, n_features)
            The input samples.

        Returns
        -------
        outlier_labels : numpy array of shape (n_samples,)
            For each observation, tells whether or not
            it should be considered as an outlier according to the
            fitted model. 0 stands for inliers and 1 for outliers.
        """[0m[1;33m
[0m[1;33m
[0m        [0mcheck_is_fitted[0m[1;33m([0m[0mself[0m[1;33m,[0m [1;33m[[0m[1;34m'decision_scores_'[0m[1;33m,[0m [1;34m'threshold_'[0m[1;33m,[0m [1;34m'labels_'[0m[1;33m][0m[1;33m)

In [24]:
PCA.decision_function??

[1;31mSignature:[0m [0mPCA[0m[1;33m.[0m[0mdecision_function[0m[1;33m([0m[0mself[0m[1;33m,[0m [0mX[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mSource:[0m   
    [1;32mdef[0m [0mdecision_function[0m[1;33m([0m[0mself[0m[1;33m,[0m [0mX[0m[1;33m)[0m[1;33m:[0m[1;33m
[0m        [1;34m"""Predict raw anomaly score of X using the fitted detector.

        The anomaly score of an input sample is computed based on different
        detector algorithms. For consistency, outliers are assigned with
        larger anomaly scores.

        Parameters
        ----------
        X : numpy array of shape (n_samples, n_features)
            The training input samples. Sparse matrices are accepted only
            if they are supported by the base estimator.

        Returns
        -------
        anomaly_scores : numpy array of shape (n_samples,)
            The anomaly score of the input samples.
        """[0m[1;33m
[0m        [0mcheck_is_fitted[0m[1;33m([0