# Agenda

1. Detection

    * Different tasks/taxonomy
    * Conceptual model
    * Interface

2. `skchange` algorithm framework

    * Interval scoring
    * Search

3. Interval scoring

    * Costs
    * Change scores
    * Anomaly scores

4. Change detection

5. Segment anomaly detection

6. Air handling unit data example

## The ~~annotation~~ detection module
<img src="img/annotation_tree.png" width="800">

Experimental module, still under heavy development.

Some discrepancies between `sktime` and `skchange` are still expected for some time.

Contributions appreciated!

# Detection tasks

1. Change detection
2. Segmentation
3. Point anomaly/outlier detection
4. Segment anomaly detection


## Change detection and segmentation

In [306]:
# Import packages used through the notebook
import numpy as np
import pandas as pd
import plotly.express as px

from utils import (
    plot_multivariate_time_series,
    add_changepoint_vlines,
    add_segmentation_vrects,
    add_subset_segment_anomaly_vrects,
)

base_width = 800
base_height = 400

In [307]:
from skchange.datasets.generate import generate_changing_data

n = 310
cpts = [50, 150]
means = [0.0, -3, 2.0]
df = generate_changing_data(n, cpts, means, random_state=3)

cpt_fig = plot_multivariate_time_series(df)
cpt_fig = add_changepoint_vlines(cpt_fig, cpts)
for i, cpt in enumerate(cpts):
    cpt_fig.add_annotation(
        x=cpt,
        y=-0.13,
        text=f"changepoint {i+1}",
        showarrow=False,
        yshift=-10,
        font=dict(size=16),
        xref="x",
        yref="paper",
    )
cpt_fig.update_layout(
    showlegend=False, xaxis_title=None, width=base_width, height=base_height
)
cpt_fig.show()

Change detection: Detect points in time where the data generating process changes significantly.

In [308]:
extended_cpts = [0] + cpts + [n]
segments = pd.Series(
    [
        pd.Interval(extended_cpts[i], extended_cpts[i + 1], closed="left")
        for i in range(len(extended_cpts) - 1)
    ]
)
segment_fig = plot_multivariate_time_series(df)
segment_fig = add_segmentation_vrects(segment_fig, segments)
for i, segment in enumerate(segments):
    segment_fig.add_annotation(
        x=segment.mid,
        y=-0.13,
        text=f"segment {i}",
        showarrow=False,
        yshift=-10,
        font=dict(size=16),
        xref="x",
        yref="paper",
    )
segment_fig.update_layout(
    showlegend=False, xaxis_title=None, width=base_width, height=base_height
)
segment_fig.show()

Segmentation: Divide the data into segments based on certain criteria. The same label can be applied at multiple disconnected segments.

Closely related to change detection, but extra information can be present in the labels.

### Use cases

* Data cleaning: Remove segments that are not relevant for the analysis.
* Preprocessing: Divide the data into homogenous parts for individual analysis.
* Detect interesting patterns: Anomaly detection, motif discovery, state transitions.

## Point and segment anomaly detection

In [309]:
df = generate_changing_data(n)
outliers = [60, 238, 290]
df.iloc[outliers] = np.random.uniform(4, 8, (len(outliers), df.shape[1]))
outlier_plot = plot_multivariate_time_series(df)
outlier_plot.add_scatter(
    x=outliers,
    y=df.iloc[outliers, 0],
    mode="markers",
    marker=dict(symbol="x", size=10, color="red"),
    name="Point anomaly",
).update_layout(width=base_width, height=base_height)
outlier_plot.show()

Point anomaly detection: Detect individual data points that are significantly different from the rest of the data.

In [310]:
from skchange.datasets.generate import generate_anomalous_data

anomalies = [
    (80, 100),
    (200, 300),
]
means = [5.0, 2.0]
df = generate_anomalous_data(n, anomalies, means, random_state=8)
anomaly_plot = plot_multivariate_time_series(df)
segments = pd.Series([pd.Interval(*anomaly, closed="left") for anomaly in anomalies])
anomaly_plot = add_segmentation_vrects(anomaly_plot, segments, ["red"])
for i, segment in enumerate(segments):
    anomaly_plot.add_annotation(
        x=segment.mid,
        y=-0.13,
        text=f"Segment anomaly {i+1}",
        showarrow=False,
        yshift=-10,
        font=dict(size=16),
        xref="x",
        yref="paper",
    )
anomaly_plot.update_layout(
    showlegend=False, xaxis_title=None, width=base_width, height=base_height
)
anomaly_plot

Segment anomaly detection: Detect segments of data that are significantly different from the rest of the data.

### Use cases

* Data cleaning: Remove anomalies from the data.
* Detect interesting events: Fault detection, fraud detection, etc.

# Detector conceptual model

1. Set hyperparameters
2. Fit the detector to training data
3. Detect events on new data

    * Input: A time series
    * Output: Detected events.

        - Change points
        - Segments
        - Point anomalies
        - Segment anomalies
        - Change intervals
        - ... 

# Detector interface

* `__init__(self, ...)`

    - Set hyperparameters.
* `fit(self, X, y=None)`

    - Fit the detector to the training data.
* `predict(self, X)`

    - Detect events on new data. 
    - Sparse format: One entry per detected event.
* `transform(X)` 

    - Detect events on new data. 
    - Dense format: One entry per input time point.
    - Default: Run `predict` + sparse to dense conversion.
* `transform_scores(X)` [optional] 

    - Return detection scores for each time point.

# `skchange` algorithm framework

* All detectors in `skchange` are ___search methods___ composed of an ___interval scorer___. 

    - Similar pattern as `ruptures`, but more general and performance oriented.
* `sktime` contains algorithms that does not follow this pattern.

### Interval scorer

- An abstraction for unifying evaluation of cost functions and statistical tests for change and anomaly detection.
- Role: Compute scores efficiently for many data cuts.
- __Cuts__: Intervals, intervals with a split point, intervals with an inner interval, ...

### Search method

- Detects events by optimizing the scores from the interval scorer.
- Role: Which cuts should be evaluated and how to convert the scores to detect events.


# Interval scoring

## Costs

Generate some Gaussian toy data with a single changepoint.

In [311]:
from skchange.datasets import generate_alternating_data

change_point = 50
single_cpt_df = generate_alternating_data(
    n_segments=2, segment_length=change_point, mean=5, random_state=0
)
single_cpt_df

Unnamed: 0,0
0,1.764052
1,0.400157
2,0.978738
3,2.240893
4,1.867558
...,...
95,5.706573
96,5.010500
97,6.785870
98,5.126912


In [312]:
px.line(single_cpt_df).update_layout(
    width=base_width, height=base_height, showlegend=False
)

Costs are evaluated over interval cuts. 

Fit and evaluate the L2 cost for a constant mean model.

In [313]:
from skchange.costs import L2Cost

cost = L2Cost()  # L2 cost function for a constant mean.
cost.fit(single_cpt_df)  # Precomputes sums and sums of squares.

# cuts = [start, end] for costs.
interval_cuts = [[0, 50], [25, 75], [50, 100]]
cost_values = cost.evaluate(interval_cuts)  # Uses precomputed sums + numba to evaluate.
cost_values

array([[ 63.34009223],
       [343.56245089],
       [ 37.59049314]])

The lower the cost, the better the fit. High cost in the middle interval because a constant mean fits poorly due to the change point at index 50.

Currently three costs available:

* `L2Cost`: L2 cost for a constant mean model.
* `GaussianVarCost`: Log-likelihood cost for a univariate Gaussian model.
* `GaussianCovCost`: Log-likelihood cost for a multivariate Gaussian model.

Plenty more to come in the future.

## Change scores

Change scores are statistical tests that quantify the evidence for a single change in the data.

They are evaluated over `(start, split, end)` cuts, where the data subsets `X[start:split]` and `X[split:end]` are compared.

All costs can be used to construct a change score:
```
score.evaluate([start, split, end]) = cost.evaluate([start, end]) - cost.evaluate([start, split]) - cost.evaluate([split, end])
```

Fit and evaluate the change score constructed from the L2 cost.

In [314]:
from skchange.change_scores import ChangeScore
from skchange.costs import L2Cost

change_score = ChangeScore(cost=L2Cost())
change_score.fit(single_cpt_df)
interval_split_cuts = [[0, 25, 50], [25, 50, 75], [50, 75, 100]]
change_score_values = change_score.evaluate(interval_split_cuts)
change_score_values

array([[  5.59834604],
       [301.44887152],
       [  3.44050646]])

The higher the score, the more evidence for a change. 

High score for the second cut [25, 50, 75] because of the true change point at index 50.

Also support for change scores that cannot be formulated in terms of costs, or cases where computational efficiency can be gained from calculating the change scores directly.

In [315]:
from skchange.change_scores import CUSUM

# The CUSUM is the most famous change point test.
# It is a direct compution of the square root of the change score based on the L2 cost.
change_score = CUSUM()
change_score.fit(single_cpt_df)
change_score.evaluate(interval_split_cuts) ** 2

array([[  5.59834604],
       [301.44887152],
       [  3.44050646]])


# Change detection

Let's generate a slightly more complicated data set to detect change points in.

In [316]:
from skchange.datasets.generate import generate_anomalous_data

# Generate data
n = 300
anomalies = [
    (100, 120),
    (250, 300),
]
means = [
    np.array([8.0, 0.0, 0.0]),
    np.array([2.0, 3.0, 5.0]),
]
df_3d_2anomalies = generate_anomalous_data(
    n, anomalies=anomalies, means=means, random_state=3
)
# df.index = pd.date_range(start="2024-11-01", periods=n, freq="h")jA
df_3d_2anomalies

Unnamed: 0,0,1,2
0,1.788628,0.436510,0.096497
1,-1.863493,-0.277388,-0.354759
2,-0.082741,-0.627001,-0.043818
3,-0.477218,-1.313865,0.884622
4,0.881318,1.709573,0.050034
...,...,...,...
295,1.181978,3.571703,6.375051
296,2.403536,2.262568,4.251524
297,3.365367,2.201864,5.089317
298,1.741105,3.053021,6.096369


In [317]:
fig_3d = plot_multivariate_time_series(df_3d_2anomalies).update_layout(
    width=base_width, height=base_height
)

fig_3d.show()

In [318]:
from skchange.change_detectors import SeededBinarySegmentation

change_score = CUSUM()
change_detector = SeededBinarySegmentation(
    change_score,
    threshold_scale=1.0,
    min_segment_length=5,
    max_interval_length=200,
)
changepoints = change_detector.fit_predict(df_3d_2anomalies)
changepoint_labels = change_detector.transform(df_3d_2anomalies)
print(changepoints)
print("============================================")
print(changepoint_labels)

0    100
1    120
2    250
Name: changepoint, dtype: int64
0      0
1      0
2      0
3      0
4      0
      ..
295    3
296    3
297    3
298    3
299    3
Name: segment_label, Length: 300, dtype: int64


In [319]:
fig_3d_with_changepoints = add_changepoint_vlines(fig_3d, changepoints)
fig_3d_with_changepoints.show()

In [320]:
segments = (
    changepoint_labels.to_frame()
    .groupby("segment_label")
    .apply(lambda x: pd.Interval(x.index[0], x.index[-1] + 1), include_groups=False)
)
add_segmentation_vrects(fig_3d, segments)

# Segment anomaly detection

## Anomaly scores

[TODO: Place here or in interval scorer section after change scores?]

Two supported types of anomaly scores at the moment:

* __Saving__: The difference in cost between a fixed baseline parameter and an optimal parameter.

    - A global anomaly score. 
    - Assumes the fixed parameter is estimated robustly over the entire time series.
* __Local anomaly score__: The difference in cost between an inner interval (the anomaly) and an outer interval.

We only cover the saving here.

### Fixed cost

* Optimised/estimated parameter: `BaseCost(param=None)`
* Fixed parameter: `BaseCost(param: float|np.ndarray)`

In [321]:
from skchange.costs import L2Cost

baseline_cost = L2Cost(param=0)  # fixed mean = 0
baseline_cost.fit(single_cpt_df)
baseline_cost.evaluate(interval_cuts)

array([[  64.32793769],
       [ 599.24590043],
       [1277.14080349]])

In [322]:
from skchange.anomaly_scores import Saving

saving = Saving(baseline_cost=baseline_cost)
saving.fit(single_cpt_df)
saving.evaluate(interval_cuts)

array([[9.87845452e-01],
       [2.55683450e+02],
       [1.23955031e+03]])

* There's little value in fitting another mean than 0 in the first interval.
* There's high value in fitting another mean than 0 in the second and third intervals.

In [323]:
px.line(single_cpt_df).update_layout(
    width=base_width, height=base_height, showlegend=False
)

## Anomaly detection

In [324]:
from skchange.anomaly_detectors import CAPA

anomaly_detector = CAPA(saving, collective_penalty_scale=1.0)
anomalies = anomaly_detector.fit_predict(df_3d_2anomalies)
anomaly_labels = anomaly_detector.transform(df_3d_2anomalies)
print(anomalies)
print("============================================")
print(anomaly_labels)

0    [100, 120)
1    [250, 300)
Name: anomaly_interval, dtype: interval
0      0
1      0
2      0
3      0
4      0
      ..
295    2
296    2
297    2
298    2
299    2
Name: anomaly_label, Length: 300, dtype: int64


In [325]:
add_segmentation_vrects(fig_3d, anomalies, colors=["red"])

## Subset anomaly detection

Root cause analysis: Detect the subset of the data that causes the anomaly.

In [326]:
from skchange.anomaly_detectors import MVCAPA

subset_anomaly_detector = MVCAPA()
subset_anomalies = subset_anomaly_detector.fit_predict(df_3d_2anomalies)
subset_anomaly_labels = subset_anomaly_detector.transform(df_3d_2anomalies)
print(subset_anomalies)
print("============================================")
print(subset_anomaly_labels)

  anomaly_interval anomaly_columns
0       [100, 120)             [0]
1       [250, 300)       [2, 1, 0]
     0  1  2
0    0  0  0
1    0  0  0
2    0  0  0
3    0  0  0
4    0  0  0
..  .. .. ..
295  2  2  2
296  2  2  2
297  2  2  2
298  2  2  2
299  2  2  2

[300 rows x 3 columns]


In [327]:
add_subset_segment_anomaly_vrects(fig_3d, subset_anomalies)

# Air handling unit dataset

* Vibration magnitude data from an air handling unit.
* Data from the company Soundsensing in Norway.
* A research project on using vibration and sound sensors to detect anomalies in technical equipement in commercial buildings.

Task: Detect when the machine switches between states.

The particular air handling unit in this dataset operates in three different states: Off, on and an intermediate state.


In [328]:
data_folder = "../data/"
df_air_handling = pd.read_csv(data_folder + "air_handling_unit.csv").set_index("time")
df_air_handling.index = pd.to_datetime(df_air_handling.index)
df_air_handling

Unnamed: 0_level_0,vibration_magnitude
time,Unnamed: 1_level_1
2023-03-18 06:00:00+00:00,0.469595
2023-03-18 06:10:00+00:00,0.465891
2023-03-18 06:20:00+00:00,0.423907
2023-03-18 06:30:00+00:00,0.419382
2023-03-18 06:40:00+00:00,0.416293
...,...
2023-04-17 05:20:00+00:00,0.444172
2023-04-17 05:30:00+00:00,0.438997
2023-04-17 05:40:00+00:00,0.435317
2023-04-17 05:50:00+00:00,0.433883


In [329]:
vib_fig_width = 1.5 * base_width
vib_fig_height = 1.2 * base_height

vib_fig = px.line(df_air_handling["vibration_magnitude"]).update_layout(
    width=vib_fig_width, height=vib_fig_height
)
vib_fig.show()

In [330]:
# For the default penalty values to work, the data should be standardized with respect to the within-segment standard deviation.
# This can be a bit hard to get right.
# Either estimate scale well or adjust the penalty.

# Standard deviation or MAD of the differenced data is a common way of estimating the scale.
# This fails here, however.
df_diff = df_air_handling.diff().dropna()
differenced_mad = (df_diff - df_diff.median()).abs().quantile(0.9)
scale = 1.4826 * differenced_mad / np.sqrt(2)
# df["standardized_vibration_magnitude"] = (df - df.median()) / scale

df_air_handling["standardized_vibration_magnitude"] = (
    (df_air_handling - df_air_handling.mean()) / df_diff.std() / np.sqrt(2)
)

# Use regular standardization instead.
# df["standardized_vibration_magnitude"] = (df - df.mean()) / df.std()

In [331]:
standardized_vib_fig = px.line(
    df_air_handling["standardized_vibration_magnitude"]
).update_layout(width=vib_fig_width, height=vib_fig_height)
standardized_vib_fig.show()

In [332]:
from skchange.change_detectors import PELT
from skchange.costs import L2Cost

change_detector = PELT(cost=L2Cost(), penalty_scale=0.8)

x = df_air_handling["standardized_vibration_magnitude"]
change_detector.fit(x)
changepoints = change_detector.predict(x)
changepoints

0        22
1        50
2        66
3       132
4       156
       ... 
114    4158
115    4182
116    4212
117    4236
118    4302
Name: changepoint, Length: 119, dtype: int64

In [333]:
# for cpt in changepoints:
#     vib_fig.add_vline(x=x.index[cpt], line_color="red", line_width=1)
# vib_fig.show()
add_changepoint_vlines(vib_fig, x.index[changepoints])

In [334]:
segment_labels = change_detector.transform(x)
segments = (
    segment_labels.to_frame()
    .groupby("segment_label")
    .apply(lambda x: pd.Interval(x.index[0], x.index[-1] + pd.Timedelta("10min")), include_groups=False)
)
add_segmentation_vrects(vib_fig, segments)

In [335]:
df_air_handling

Unnamed: 0_level_0,vibration_magnitude,standardized_vibration_magnitude
time,Unnamed: 1_level_1,Unnamed: 2_level_1
2023-03-18 06:00:00+00:00,0.469595,4.016322
2023-03-18 06:10:00+00:00,0.465891,3.970867
2023-03-18 06:20:00+00:00,0.423907,3.455684
2023-03-18 06:30:00+00:00,0.419382,3.400155
2023-03-18 06:40:00+00:00,0.416293,3.362252
...,...,...
2023-04-17 05:20:00+00:00,0.444172,3.704351
2023-04-17 05:30:00+00:00,0.438997,3.640849
2023-04-17 05:40:00+00:00,0.435317,3.595691
2023-04-17 05:50:00+00:00,0.433883,3.578093


In [336]:
df_air_handling["segment_label"] = segment_labels
segment_means = df_air_handling.groupby("segment_label")["vibration_magnitude"].mean()
df_air_handling = df_air_handling.merge(
    segment_means, on="segment_label", suffixes=("", "_mean")
)
df_air_handling

Unnamed: 0,vibration_magnitude,standardized_vibration_magnitude,segment_label,vibration_magnitude_mean
0,0.469595,4.016322,0,0.421019
1,0.465891,3.970867,0,0.421019
2,0.423907,3.455684,0,0.421019
3,0.419382,3.400155,0,0.421019
4,0.416293,3.362252,0,0.421019
...,...,...,...,...
4306,0.444172,3.704351,119,0.442113
4307,0.438997,3.640849,119,0.442113
4308,0.435317,3.595691,119,0.442113
4309,0.433883,3.578093,119,0.442113


In [344]:
px.line(df_air_handling, y=["vibration_magnitude", "vibration_magnitude_mean"]).update_layout(
    width=vib_fig_width, height=vib_fig_height
)

In [338]:
px.histogram(segment_means, nbins=500).update_layout(width=base_width, height=base_height)

In [339]:
px.histogram(df_air_handling["vibration_magnitude"], nbins=500).update_layout(
    width=base_width, height=base_height
)

Clearly three states in the data. Both seen from the raw data and the segment means.

A good use case for segmentation: 

* Change point detection -> aggregation -> clustering -> segmentation.

The change point -> aggregation step reduces the data size from 17244 samples to 118 before clustering.

# Change in covariance matrix example

TODO: Might get a good real data set for this.

In [349]:
from scipy.stats import multivariate_normal
import scipy.linalg

p = 10
# Generate a random 10x10 covariance matrix

# Generate a random 10x10 covariance matrix
cov_matrix = scipy.linalg.toeplitz(0.9 ** np.arange(p))
cov_matrix.shape

values = np.concatenate(
    (
        multivariate_normal.rvs(np.zeros(p), cov_matrix, 100),
        multivariate_normal.rvs(np.zeros(p), np.eye(p), 100),
    )
)
df = pd.DataFrame(values)
df["label"] = np.concatenate((np.zeros(100), np.ones(100))).astype(int).astype(str)

x = df.iloc[:, :-1]
plot_multivariate_time_series(x).update_layout(width=1.5*base_width, height=2*base_height).show()

In [341]:
px.scatter(df, x=0, y=1, color="label", width=600, height=600)

In [342]:
from skchange.costs import GaussianCovCost

cost = GaussianCovCost()
change_detector = PELT(cost=cost, penalty_scale=1.0, min_segment_length=20)
change_detector.fit(x)
change_detector.predict(x)

0    100
Name: changepoint, dtype: int64

# Future developement

* Generalized tuning of hyperparameters across detectors. Penalties/thresholds in particular.
* Standard preprocessing tools for change and anomaly detection.
* Plenty more costs, change scores and anomaly scores.

# Credits: Detection notebook

notebook creation: tveten, Norsk Regnesentral

detection module design: fkiraly, miraep8, alex-jg3, lovkush-a, aiwalter, duydl, katiebuc, tveten