### Agenda

1. Detection tasks
2. The Detector

    * Conceptual model
    * Interface
3. `skchange` algorithm framework

    * Interval scoring
    * Search

4. Interval scoring

    * Costs
    * Change scores
    * Anomaly scores

5. Change detection

6. Segment anomaly detection

7. Air handling unit data example

In [627]:
# Import packages used through the notebook
import numpy as np
import pandas as pd
import plotly.express as px

from utils import (
    plot_changepoint_illustration,
    plot_segmentation_illustration,
    plot_point_anomaly_illustration,
    plot_segment_anomaly_illustration,
    plot_multivariate_time_series,
    add_changepoint_vlines,
    add_segmentation_vrects,
    add_subset_segment_anomaly_vrects,
)

base_width = 800
base_height = 400

# The detection module
<!-- <img src="img/annotation_tree.png" width="800"> -->

Experimental module, still under heavy development.

Some discrepancies between `sktime` and `skchange` are still expected for some time.

Contributions appreciated!

# Detection tasks

1. Change detection
2. Segmentation
3. Point anomaly/outlier detection
4. Segment anomaly detection (special case of segmentation)


## Change detection

In [628]:
from skchange.datasets.generate import generate_changing_data

n = 310

cpts = [50, 150]
means = [0.0, -3, 0.0]
df = generate_changing_data(n, cpts, means, random_state=3)
plot_changepoint_illustration(df, cpts, base_width, base_height)

Change detection: Detect points in time where the data generating process changes significantly.

## Segmentation

In [629]:
extended_cpts = [0] + cpts + [n]
segments = pd.Series(
    [
        pd.Interval(extended_cpts[i], extended_cpts[i + 1], closed="left")
        for i in range(len(extended_cpts) - 1)
    ]
)
segment_labels = pd.Series([0, 1, 0])
plot_segmentation_illustration(df, segments, segment_labels, base_width, base_height)

Segmentation: Divide the data into segments based on certain criteria. The same label can be applied at multiple disconnected segments.

Closely related to change detection, but extra information can be present in the labels.

### Use cases of change detection and segmentation

* Data cleaning: Remove segments that are not relevant for the analysis.
* Preprocessing: Divide the data into homogenous parts for individual analysis.
* Detect interesting patterns: Anomaly detection, motif discovery, state transitions.

## Point anomaly detection

In [630]:
df = generate_changing_data(n)
outliers = [60, 238, 290]
df.iloc[outliers] = np.random.uniform(4, 8, (len(outliers), df.shape[1]))
plot_point_anomaly_illustration(df, outliers, base_width, base_height)

Point anomaly detection: Detect individual data points that are significantly different from the rest of the data.

Will not be covered in this tutorial. Algorithms available in `sktime`.

## Segment anomaly detection

In [631]:
from skchange.datasets.generate import generate_anomalous_data

anomalies = [
    (80, 100),
    (200, 300),
]
means = [5.0, 2.0]
df = generate_anomalous_data(n, anomalies, means, random_state=8)
plot_segment_anomaly_illustration(df, anomalies, base_width, base_height)

Segment anomaly detection: Detect segments of data that are significantly different from the rest of the data.

Can be viewed as a special case of segmentation and change detection:

* Segment anomaly = A change away from the baseline data behaviour + a change back again.

### Use cases of anomaly detection

* Data cleaning: Remove anomalies from the data.
* Detect interesting events: Fault detection, fraud detection, environmental monitoring, health monitoring etc.

# The Detector

## Conceptual model
What does all the problems above have in common?

1. Input: A time series.
2. Output: 
    * Locations of events in the time series
    * Optionally: Additional information per event (like labels).

Length(output) = Number of detected events.

Content and meaning of each detected event differ between the problems:


## Interface

* To illustrate: Use `MovingWindow` as a dummy detector.
* We will go more in depth later.

### Initialize
Set the hyperparameters.

In [632]:
from skchange.change_detectors import MovingWindow
from skchange.change_scores import CUSUM

detector = MovingWindow(
    change_score=CUSUM(),
    bandwidth=20,
    threshold_scale=1.0,
)
detector

### Fit
Fit the detector to training data.

In [680]:
df

Unnamed: 0,0
0,0.091205
1,1.091283
2,-1.946970
3,-1.386350
4,-2.296492
...,...
305,-1.476566
306,-1.808385
307,-0.189663
308,-0.592324


In [683]:
detector = detector.fit(df)
detector

Supervised detectors are also supported.

The general signature is:
`fit(self, X, y=None)`

### Predict

* Detect events on new data.
* Sparse format: One entry per detected event.

In [677]:
detections = detector.predict(df)
detections

0     80
1    100
2    200
Name: changepoint, dtype: int64

In [678]:
pred_output_fig = plot_multivariate_time_series(df)
add_changepoint_vlines(pred_output_fig, detections)

### Transform
* Detect events on new data.
* Dense format: Labels each row of the input time series.

In [692]:
labels = detector.transform(df)
labels

0      0
1      0
2      0
3      0
4      0
      ..
305    3
306    3
307    3
308    3
309    3
Name: segment_label, Length: 310, dtype: int64

In [693]:
plot_df = df.copy()
plot_df["label"] = labels
px.line(plot_df, color="label")

### Transform scores (optional)
Return detection scores for each row of the input time series.

In [687]:
detection_scores = detector.transform_scores(df)
detection_scores

0      0.0
1      0.0
2      0.0
3      0.0
4      0.0
      ... 
305    0.0
306    0.0
307    0.0
308    0.0
309    0.0
Name: score, Length: 310, dtype: float64

In [688]:
px.line(detection_scores)

# `skchange` algorithm framework

* All detectors in `skchange` are ___search methods___ composed of an ___interval scorer___. 

    - Similar pattern as `ruptures`, but more general and performance oriented.
* `sktime` contains algorithms that does not follow this pattern.

### Interval scorer

- An abstraction for unifying evaluation of cost functions and statistical tests for change and anomaly detection.
- Role: Compute scores efficiently for many data cuts.
- __Cuts__: Intervals, intervals with a split point, intervals with an inner interval, ...

### Search method

- Detects events by optimizing the scores from the interval scorer.
- Role: Which cuts should be evaluated and how to convert the scores to detect events.


# Interval scoring

## Costs

Generate some Gaussian toy data with a single changepoint.

In [637]:
from skchange.datasets import generate_alternating_data

change_point = 50
single_cpt_df = generate_alternating_data(
    n_segments=2, segment_length=change_point, mean=5, random_state=0
)
single_cpt_df

Unnamed: 0,0
0,1.764052
1,0.400157
2,0.978738
3,2.240893
4,1.867558
...,...
95,5.706573
96,5.010500
97,6.785870
98,5.126912


In [638]:
px.line(single_cpt_df).update_layout(
    width=base_width, height=base_height, showlegend=False
)

Costs are evaluated over interval cuts. 

Fit and evaluate the L2 cost for a constant mean model.

In [639]:
from skchange.costs import L2Cost

cost = L2Cost()  # L2 cost function for a constant mean.
cost.fit(single_cpt_df)  # Precomputes sums and sums of squares.

# cuts = [start, end] for costs.
interval_cuts = [[0, 50], [25, 75], [50, 100]]
cost_values = cost.evaluate(interval_cuts)  # Uses precomputed sums + numba to evaluate.
cost_values

array([[ 63.34009223],
       [343.56245089],
       [ 37.59049314]])

The lower the cost, the better the fit. High cost in the middle interval because a constant mean fits poorly due to the change point at index 50.

Currently three costs available:

* `L2Cost`: L2 cost for a constant mean model.
* `GaussianVarCost`: Log-likelihood cost for a univariate Gaussian model.
* `GaussianCovCost`: Log-likelihood cost for a multivariate Gaussian model.

Plenty more to come in the future.

## Change scores

Change scores are statistical tests that quantify the evidence for a single change in the data.

They are evaluated over `(start, split, end)` cuts, where the data subsets `X[start:split]` and `X[split:end]` are compared.

All costs can be used to construct a change score:
```
score.evaluate([start, split, end]) = cost.evaluate([start, end]) - cost.evaluate([start, split]) - cost.evaluate([split, end])
```

Fit and evaluate the change score constructed from the L2 cost.

In [640]:
from skchange.change_scores import ChangeScore
from skchange.costs import L2Cost

change_score = ChangeScore(cost=L2Cost())
change_score.fit(single_cpt_df)
interval_split_cuts = [[0, 25, 50], [25, 50, 75], [50, 75, 100]]
change_score_values = change_score.evaluate(interval_split_cuts)
change_score_values

array([[  5.59834604],
       [301.44887152],
       [  3.44050646]])

The higher the score, the more evidence for a change. 

High score for the second cut [25, 50, 75] because of the true change point at index 50.

Also support for change scores that cannot be formulated in terms of costs, or cases where computational efficiency can be gained from calculating the change scores directly.

In [641]:
from skchange.change_scores import CUSUM

# The CUSUM is the most famous change point test.
# It is a direct compution of the square root of the change score based on the L2 cost.
change_score = CUSUM()
change_score.fit(single_cpt_df)
change_score.evaluate(interval_split_cuts) ** 2

array([[  5.59834604],
       [301.44887152],
       [  3.44050646]])


# Change detection

Let's generate a slightly more complicated data set to detect change points in.

In [642]:
from skchange.datasets.generate import generate_anomalous_data

# Generate data
n = 300
anomalies = [
    (100, 120),
    (250, 300),
]
means = [
    np.array([8.0, 0.0, 0.0]),
    np.array([2.0, 3.0, 5.0]),
]
df_3d_2anomalies = generate_anomalous_data(
    n, anomalies=anomalies, means=means, random_state=3
)
# df.index = pd.date_range(start="2024-11-01", periods=n, freq="h")jA
df_3d_2anomalies

Unnamed: 0,0,1,2
0,1.788628,0.436510,0.096497
1,-1.863493,-0.277388,-0.354759
2,-0.082741,-0.627001,-0.043818
3,-0.477218,-1.313865,0.884622
4,0.881318,1.709573,0.050034
...,...,...,...
295,1.181978,3.571703,6.375051
296,2.403536,2.262568,4.251524
297,3.365367,2.201864,5.089317
298,1.741105,3.053021,6.096369


In [643]:
fig_3d = plot_multivariate_time_series(df_3d_2anomalies).update_layout(
    width=base_width, height=base_height
)

fig_3d.show()

Use the `MovingWindow` search method to detect change points.

This is the fastest and conceptually simplest search method available in `skchange`, but it often does the job.

1. Set a `bandwidth` and `threshold_scale` parameter.
2. Evaluate the change score by moving a window of `2*bandwidth` data points with a split in the middle along the time series.

In [644]:
from skchange.change_detectors import MovingWindow

change_detector = MovingWindow(
    change_score=CUSUM(),
    bandwidth=15,  # The number of samples on each side of a split point.
    # Scaling factor for the threshold. The threshold is set to
    # `threshold_scale * default_threshold`, where the default threshold depends on
    # the number of samples, the number of variables and `bandwidth`.
    threshold_scale=1.0,  # How much to scale the default threshold.
)
change_detector.fit(df_3d_2anomalies)

Due to its simplicity, it has a neat and intuitive visual representation.

In [645]:
window_scores = change_detector.transform_scores(df_3d_2anomalies)
px.line(window_scores).update_layout(
    yaxis_title="CUSUM change score",
    width=base_width,
    height=base_height,
    showlegend=False,
)

The peaks in change scores correspond to the changepoints detected by the moving window algorithm.

In [646]:
changepoints = change_detector.predict(df_3d_2anomalies)
fig_3d_with_changepoints = add_changepoint_vlines(fig_3d, changepoints)

fig_3d_with_changepoints.show()
changepoints

0    100
1    120
2    250
Name: changepoint, dtype: int64

Currently, you get the segment labels from `transform` of change detectors in `skchange`. In `sktime`, currently, you get an indicator for the change point locations.

In [647]:
segment_labels = change_detector.transform(df_3d_2anomalies)
segment_labels

0      0
1      0
2      0
3      0
4      0
      ..
295    3
296    3
297    3
298    3
299    3
Name: segment_label, Length: 300, dtype: int64

In [648]:
segments = (
    segment_labels.to_frame()
    .groupby("segment_label")
    .apply(lambda x: pd.Interval(x.index[0], x.index[-1] + 1), include_groups=False)
)
add_segmentation_vrects(fig_3d, segments)

Note: The `MovingWindow` search method may fail for more complicated change point settings.

Observe what happens if the `bandwidth` is set too high compared to the length of a segment.

In [649]:
change_detector = MovingWindow(
    CUSUM(),
    bandwidth=30,
    threshold_scale=1.0,
)
change_detector.fit(df_3d_2anomalies)
window_scores_bandwidth40 = change_detector.transform_scores(df_3d_2anomalies)
px.line(window_scores_bandwidth40).update_layout(
    yaxis_title="CUSUM change score",
    width=base_width,
    height=base_height,
    showlegend=False,
).show()
change_detector.predict(df_3d_2anomalies)

0     91
1    129
2    250
Name: changepoint, dtype: int64

* The `bandwidth` needs to be smaller than the shortest segment to work properly!
* But the smaller the `bandwidth`, the lower the detection power of the statistical test.
* Other algorithms in `skchange` are more robust to this issue, at a slightly higher computational cost. Other options:

  - `SeededBinarySegmentation`. Based on change scores. Approximate cost optimization.
  - `PELT`. Based on costs. Exact cost optimization.

# Segment anomaly detection

But first we need more kinds of interval scorers: Anomaly scores.

## Anomaly scores

Two supported types of anomaly scores at the moment:

* __Saving__: The difference in cost between a fixed baseline parameter and an optimal parameter.

    - The fixed parameter represents the baseline data behaviour.
    - A robust estimator should be used to estimate it in advance of the detection task.
    - A global anomaly score. 
* __Local anomaly score__: The difference in cost between an inner interval (the anomaly) and the commplement of an outer interval.

We only cover the saving now.

Recall the single changepoint data set.

In [650]:
px.line(single_cpt_df).update_layout(
    width=base_width, height=base_height, showlegend=False
)

### Fixed cost
All costs in `skchange` can be configured to evaluate for both a **fixed** and **optimal** parameter.
```
BaseCost(param: float|np.ndarray|None=None),
```

where

* Optimal: `param=None`
* Fixed: `param: float|np.ndarray`

In [651]:
from skchange.costs import L2Cost

baseline_cost = L2Cost(param=0)  # fixed mean = 0
baseline_cost.fit(single_cpt_df)
baseline_cost.evaluate(interval_cuts)

array([[  64.32793769],
       [ 599.24590043],
       [1277.14080349]])

* mean=0 is a good fit for [0, 50)
* mean=0 is a poor fit for [50, 100)

As for change scores, all costs can be used to construct a saving:
```
fixed_cost = MyCost(param=param)
optim_cost = fixed_cost.clone().set_params(param=None)
saving.evaluate([start, end]) = fixed_cost.evaluate([start, end]) - optim_cost.evaluate([start, end])
```

In [652]:
from skchange.anomaly_scores import Saving

saving = Saving(baseline_cost=baseline_cost)
saving.fit(single_cpt_df)
saving.evaluate(interval_cuts)

array([[9.87845452e-01],
       [2.55683450e+02],
       [1.23955031e+03]])

* mean=0 provides little cost saving for [0, 50)
* mean=0 provides much cost saving for [50, 100)

## Back to segment anomaly detection

Recall the 3-dimensional data set with two segment anomalies.

In [653]:
fig_3d = plot_multivariate_time_series(df_3d_2anomalies).update_layout(
    width=base_width, height=base_height
)

fig_3d.show()

Use the `CAPA` (Collective and Point Anomalies) algorithm to detect segment (and point) anomalies.

1. Set 

    * `collective_saving` and `collective_penalty_scale`,
    * `point_saving` and `point_penalty_scale`,
    * `min_segment_length` and `max_segment_length`, and
    * `ignore_point_anomalies = [True|False]`.

2. Optimise the saving by a recursive dynamic programming algorithm.

In [654]:
from skchange.anomaly_detectors import CAPA

anomaly_detector = CAPA(
    saving,
    collective_penalty_scale=1.0,
)
anomalies = anomaly_detector.fit_predict(df_3d_2anomalies)
anomaly_labels = anomaly_detector.transform(df_3d_2anomalies)
print(anomalies)
print("============================================")
print(anomaly_labels)

0    [100, 120)
1    [250, 300)
Name: anomaly_interval, dtype: interval
0      0
1      0
2      0
3      0
4      0
      ..
295    2
296    2
297    2
298    2
299    2
Name: anomaly_label, Length: 300, dtype: int64


In [655]:
add_segmentation_vrects(fig_3d, anomalies, colors=["red"])

The scores are more complicated than for the `MovingWindow` change detector.

The scores are the *cumulative* optimal savings.

They increase when the saving is larger than the penalty.

In [656]:
px.line(anomaly_detector.transform_scores(df_3d_2anomalies)).update_layout(
    yaxis_title="Cumulative optimal saving",
    width=base_width,
    height=base_height,
    showlegend=False,
)

## Subset anomaly detection

Root cause analysis: Detect the subset of the data that causes the anomaly.

`MVCAPA` (Multivariate Collective and Point Anomalies) is the only algorithm in `skchange` with this capability so far.

It is a more complex version of `CAPA` that optimises the saving over all possible subsets of the data.

In [657]:
from skchange.anomaly_detectors import MVCAPA

subset_anomaly_detector = MVCAPA()
subset_anomalies = subset_anomaly_detector.fit_predict(df_3d_2anomalies)
subset_anomaly_labels = subset_anomaly_detector.transform(df_3d_2anomalies)
print(subset_anomalies)
print("============================================")
print(subset_anomaly_labels)

  anomaly_interval anomaly_columns
0       [100, 120)             [0]
1       [250, 300)       [2, 1, 0]
     0  1  2
0    0  0  0
1    0  0  0
2    0  0  0
3    0  0  0
4    0  0  0
..  .. .. ..
295  2  2  2
296  2  2  2
297  2  2  2
298  2  2  2
299  2  2  2

[300 rows x 3 columns]


In [658]:
add_subset_segment_anomaly_vrects(fig_3d, subset_anomalies)

# HVAC system dataset

<img src="img/hvac_system_ventilation.png" alt="img/hvac_system_ventilation.png" width="400"/>

*Heating, ventilation and air conditioning (HVAC) system.*

The dataset:
    
* Two units. We look at one of them.
* Vibration magnitude sensor measurements every 10 minutes.
* 30 days of data.

Data background:
* From the company [Soundsensing](https://www.soundsensing.no/).
* Research project on detecting failing equipment in buildings using vibration and sound sensors.
* Funded by the Research Council of Norway.

In [659]:
from skchange.datasets import load_hvac_system_data

df_hvac = load_hvac_system_data().loc[1]  # only unit 1
df_hvac

Unnamed: 0_level_0,vibration
time,Unnamed: 1_level_1
2023-12-09 04:30:00+00:00,0.004123
2023-12-09 04:40:00+00:00,0.004123
2023-12-09 04:50:00+00:00,0.004123
2023-12-09 05:00:00+00:00,0.004123
2023-12-09 05:10:00+00:00,0.004123
...,...
2024-01-08 03:50:00+00:00,0.004123
2024-01-08 04:00:00+00:00,0.004123
2024-01-08 04:10:00+00:00,0.004123
2024-01-08 04:20:00+00:00,0.004123


In [660]:
true_anomaly = pd.Interval(
    pd.Timestamp("2024-01-03 06:00").tz_localize("UTC"),
    pd.Timestamp("2024-01-05 17:00").tz_localize("UTC"),
)

test_start = pd.Timestamp("2024-01-01").tz_localize("UTC")

df_hvac_train = df_hvac.loc[: test_start - pd.Timedelta(seconds=1)]
df_hvac_test = df_hvac.loc[test_start:]

In [661]:
px.line(df_hvac).update_layout(
    yaxis_title="vibration magnitude", showlegend=False
).add_vrect(
    x0=test_start,
    x1=df_hvac.index[-1],
    fillcolor="rgba(0,0,0,0.2)",
    layer="below",
    line_width=0,
    annotation_text="Test set",
    annotation_position="top left",
).add_vrect(
    x0=df_hvac.index[0],
    x1=test_start,
    fillcolor="rgba(0,0,0, 0.05)",
    layer="below",
    line_width=0,
    annotation_text="Train set",
    annotation_position="top left",
).add_vrect(
    x0=true_anomaly.left,
    x1=true_anomaly.right,
    fillcolor="rgba(255,0,0,0.3)",
    line_width=0,
    annotation_text="True anomaly",
    annotation_position="top left",
)

This particular machine has two states:

1. Off: Vibration close to 0
2. On: Vibration more than approximately 0

Task: On each weekday, when does the machine normally turn on and off?

* Used to detect deviations from its regular schedule.
* Useful information to the maintenance staff.

Note: 
* A simple thresholding algorithm could solve the problem in this example.
* Not all cases are as clear-cut!
* Only for demonstration purposes.

### Step 1: Estimate the change points

In [662]:
from skchange.change_detectors import PELT
from skchange.costs import L2Cost

# Standardize the data. Discussed at the end of the example.
std = df_hvac_train.std().iloc[0]
x_train = df_hvac_train / std
x_test = df_hvac_test / std

cost = L2Cost()
change_detector = PELT(cost)
change_detector.fit(x_train)
changepoints = change_detector.predict(x_train)
changepoints

0      296
1      365
2      440
3      509
4      584
5      653
6      728
7      797
8      872
9      941
10    1304
11    1373
12    1448
13    1517
14    1592
15    1661
16    1736
17    1804
18    1879
19    1948
20    2311
21    2380
22    2455
23    2524
24    2599
25    2668
26    2743
27    2812
28    2887
29    2956
Name: changepoint, dtype: int64

In [663]:
vib_fig_train = px.line(df_hvac_train).update_layout(
    yaxis_title="vibration magnitude", showlegend=False
)
add_changepoint_vlines(vib_fig_train, x_train.index[changepoints])

There's also a big difference in variance between the off and on states, so could also use a Gaussian cost.

In [664]:
from skchange.costs import GaussianVarCost

var_cost = GaussianVarCost()  # Only line that needs to change
var_change_detector = PELT(var_cost)
var_change_detector.fit(x_train)
var_changepoints = change_detector.predict(x_train)
var_changepoints

0      296
1      365
2      440
3      509
4      584
5      653
6      728
7      797
8      872
9      941
10    1304
11    1373
12    1448
13    1517
14    1592
15    1661
16    1736
17    1804
18    1879
19    1948
20    2311
21    2380
22    2455
23    2524
24    2599
25    2668
26    2743
27    2812
28    2887
29    2956
Name: changepoint, dtype: int64

In [665]:
add_changepoint_vlines(vib_fig_train, x_train.index[var_changepoints])

It gives the same result.

We continue with the L2 cost changepoints to detect the on/off times.

### Step 2: Convert the changepoints to on-segments

* Can use the `StatThresholdAnomaliser` in `skchange` for this.
    - Calculate the mean per segment.
    - If the mean in the interval `[stat_lower, stat_upper]` it is a "off-segment".
    - Otherwise, it is an "on-segment".
* Could have also used an anomaly detector like `CAPA` immediately.

In [666]:
from skchange.anomaly_detectors import StatThresholdAnomaliser
from utils import to_time_intervals

change_detector = PELT(L2Cost())
anomaly_detector = StatThresholdAnomaliser(
    change_detector,
    stat=np.mean,
    stat_lower=0.0,
    stat_upper=0.01 / std,  # Since we rescale the data.
)
on_segments = anomaly_detector.fit_predict(x_train)
on_segments = to_time_intervals(on_segments, x_train.index)
on_segments

0     [2023-12-11 05:50:00+00:00, 2023-12-11 17:20:0...
1     [2023-12-12 05:50:00+00:00, 2023-12-12 17:20:0...
2     [2023-12-13 05:50:00+00:00, 2023-12-13 17:20:0...
3     [2023-12-14 05:50:00+00:00, 2023-12-14 17:20:0...
4     [2023-12-15 05:50:00+00:00, 2023-12-15 17:20:0...
5     [2023-12-18 05:50:00+00:00, 2023-12-18 17:20:0...
6     [2023-12-19 05:50:00+00:00, 2023-12-19 17:20:0...
7     [2023-12-20 05:50:00+00:00, 2023-12-20 17:20:0...
8     [2023-12-21 05:50:00+00:00, 2023-12-21 17:20:0...
9     [2023-12-22 05:50:00+00:00, 2023-12-22 17:20:0...
10    [2023-12-25 05:50:00+00:00, 2023-12-25 17:20:0...
11    [2023-12-26 05:50:00+00:00, 2023-12-26 17:20:0...
12    [2023-12-27 05:50:00+00:00, 2023-12-27 17:20:0...
13    [2023-12-28 05:50:00+00:00, 2023-12-28 17:20:0...
14    [2023-12-29 05:50:00+00:00, 2023-12-29 17:20:0...
Name: interval, dtype: interval

In [667]:
add_segmentation_vrects(vib_fig_train, on_segments, colors=["green"])

### Step 3: Estimate the weekly schedule

Step 3a: Find the weekday for each on-segment.

In [668]:
on_segments = on_segments.to_frame()
on_segments["wday"] = on_segments.apply(
    lambda x: x["interval"].left.weekday(),
    axis=1,
)
on_segments

Unnamed: 0,interval,wday
0,"[2023-12-11 05:50:00+00:00, 2023-12-11 17:20:0...",0
1,"[2023-12-12 05:50:00+00:00, 2023-12-12 17:20:0...",1
2,"[2023-12-13 05:50:00+00:00, 2023-12-13 17:20:0...",2
3,"[2023-12-14 05:50:00+00:00, 2023-12-14 17:20:0...",3
4,"[2023-12-15 05:50:00+00:00, 2023-12-15 17:20:0...",4
5,"[2023-12-18 05:50:00+00:00, 2023-12-18 17:20:0...",0
6,"[2023-12-19 05:50:00+00:00, 2023-12-19 17:20:0...",1
7,"[2023-12-20 05:50:00+00:00, 2023-12-20 17:20:0...",2
8,"[2023-12-21 05:50:00+00:00, 2023-12-21 17:20:0...",3
9,"[2023-12-22 05:50:00+00:00, 2023-12-22 17:20:0...",4


Step 3b: Estimate the on and off times for each weekday.

In [669]:
def estimate_on_off_times(wday_intervals):
    on_times = wday_intervals.array.left
    off_times = wday_intervals.array.right

    on_time = (on_times - on_times.normalize()).mean()
    off_time = (off_times - off_times.normalize()).mean()

    return pd.Series({"on_time": on_time, "off_time": off_time})


schedule = on_segments.groupby("wday")["interval"].apply(estimate_on_off_times)
schedule = schedule.unstack()
schedule

Unnamed: 0_level_0,on_time,off_time
wday,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0 days 05:50:00,0 days 17:20:00
1,0 days 05:50:00,0 days 17:20:00
2,0 days 05:50:00,0 days 17:20:00
3,0 days 05:50:00,0 days 17:20:00
4,0 days 05:50:00,0 days 17:20:00


### Step 4: Analyse the test data

In [670]:
vib_fig_test = px.line(df_hvac_test).update_layout(
    yaxis_title="vibration magnitude", showlegend=False
)

test_days = x_test.index.normalize().unique()[:-1]
for day in test_days:
    wday = day.weekday()
    if wday not in schedule.index:
        continue
    on_time = schedule.loc[wday, "on_time"]
    off_time = schedule.loc[wday, "off_time"]
    vib_fig_test.add_vrect(
        x0=day + on_time,
        x1=day + off_time,
        fillcolor="rgba(0,0,255,0.2)",
        line_width=0,
        annotation_text="Expect ON",
        annotation_position="top left",
    )
vib_fig_test.show()

The deviation from the expected schedule can easily be spotted.

**!Alarm!**

### Notes on preprocessing

The default penalties in `skchange` assume that the **within-segment** data has unit variance.

There are currently three options for dealing with this:

1. Estimate the within-segment variance and standardize the data.

    * Common method: Estimate the within-segment variance by `factor*X.diff().var()/2`.
    * Works for data without too much auto-correlation.
2. Tune the penalty to the data directly.
3. A combination of the two.

**Top priority for future development**: More robust and automatic methods for tuning the penalty.


Here we used option 3:

* Rescale the data by the standard deviation of the entire training set.
    - Unless the jumps are enormous, this brings the data to a somewhat common scale.
* Tweak the penalty scale around 1 to get the segmentation right.
    - Penalty scale = 1.0 worked well for this data set. Luck.

# Change in covariance matrix example

In [671]:
from scipy.stats import multivariate_normal
import scipy.linalg

p = 10

# Generate a 10x10 covariance matrix
cov_matrix = scipy.linalg.toeplitz(0.9 ** np.arange(p))

# Generate data with two segments (one change point), one with an identity covariance
# matrix and one with the covariance matrix defined above.
values = np.concatenate(
    (
        multivariate_normal.rvs(np.zeros(p), cov_matrix, 100),
        multivariate_normal.rvs(np.zeros(p), np.eye(p), 100),
    )
)
df_cov = pd.DataFrame(values)

# Add a label column
df_cov["label"] = np.concatenate(
    [np.repeat("correlated_segment", 100), np.repeat("independent_segment", 100)]
)

x_train = df_cov.iloc[:, :-1]
plot_multivariate_time_series(x_train).update_layout(
    width=1.5 * base_width, height=2 * base_height
).show()

Not easy to spot the change!

In a scatter plot, however, we can see the change in dependency structure: From a circular to an elliptical shape.

In [672]:
px.scatter(df_cov, x=4, y=5, color="label", width=800, height=600)

In [673]:
from skchange.costs import GaussianCovCost

cost = GaussianCovCost()
change_detector = PELT(cost=cost, penalty_scale=1.0, min_segment_length=30)
change_detector.fit(x_train)
change_detector.predict(x_train)

0    100
Name: changepoint, dtype: int64

# Future developement

1. Generalized tuning of hyperparameters across detectors. Penalties/thresholds in particular.
2. Standard preprocessing tools for change and anomaly detection.
3. Plenty more costs, change scores and anomaly scores.

# Credits: Detection notebook

notebook creation: tveten, Norsk Regnesentral

detection module design: fkiraly, miraep8, alex-jg3, lovkush-a, aiwalter, duydl, katiebuc, johannvk, tveten

# References



* `skchange`: https://github.com/NorskRegnesentral/skchange
* `sktime`: https://www.sktime.net/ 
* `ruptures`:
    - https://centre-borelli.github.io/ruptures-docs/ 
    - C. Truong, L. Oudre, N. Vayatis. Selective review of offline change point\
    detection methods. Signal Processing, 167:107299, 2020.
* `PELT`: Killick, R., Fearnhead, P., & Eckley, I. A. (2012). Optimal detection of \
    changepoints with a linear computational cost. Journal of the American Statistical\
    Association, 107(500), 1590-1598.
* `Seeded binary segmentation`: Kovács, S., Bühlmann, P., Li, H., & Munk, A. (2023).\
    Seeded binary segmentation: a general methodology for fast and optimal changepoint\
    detection. Biometrika, 110(1), 249-256.
* `CAPA` collection: 
    - Fisch, A. T., Eckley, I. A., & Fearnhead, P. (2022). A linear time method\
    for the detection of collective and point anomalies. Statistical Analysis and\
        DataMining: The ASA Data Science Journal, 15(4), 494-508.
    - Fisch, A. T., Eckley, I. A., & Fearnhead, P. (2022). Subset multivariate\
    collective and point anomaly detection. Journal of Computational and Graphical\
    Statistics, 31(2), 574-585.
    - Tveten, M., Eckley, I. A., & Fearnhead, P. (2022). Scalable change-point and\
    anomaly detection in cross-correlated data with an application to condition\
    monitoring. The Annals of Applied Statistics, 16(2), 721-743.