# Agenda

1. Detection

    * Different tasks/taxonomy
    * Conceptual model
    * Interface

2. `skchange` algorithm framework

    * Interval scoring
    * Search

3. Interval scoring

    * Costs
    * Change scores
    * Anomaly scores

4. Change detection

5. Segment anomaly detection

# The detection module
Experimental module, still under heavy development.

Some discrepancies between `sktime` and `skchange` are expected.

Contributions appreciated!

# Detection tasks

1. Change detection
2. Segmentation
3. Point anomaly/outlier detection
4. Segment anomaly detection


## Change detection and segmentation

![Change detection and segmentation](img/changepoints_and_segments.png)

* Change detection: Detect points in time where the data generating process changes significantly.
* Segmentation: Divide the data into segments based on certain criteria. The same label can be applied at multiple disconnected segments.

### Use cases

* Data cleaning: Remove segments that are not relevant for the analysis.
* Preprocessing: Divide the data into homogenous parts for individual analysis.
* Detect interesting patterns: Anomaly detection, motif discovery, state transitions.

## Point and segment anomaly detection

![Point and Segment Anomalies](img/point_vs_segment_anomaly.png)

* Point anomaly detection: Detect individual data points that are significantly different from the rest of the data.
* Segment anomaly detection: Detect segments of data that are significantly different from the rest of the data.

### Use cases

* Data cleaning: Remove anomalies from the data.
* Detect interesting events: Fault detection, fraud detection, etc.

# Detector conceptual model

1. Set hyperparameters
2. Fit the detector to training data
3. Detect events on new data

    * Input: A time series
    * Output: Detected events.

        - Change points
        - Segments
        - Point anomalies
        - Segment anomalies
        - Change intervals
        - ... 

# Detector interface

* `__init__(self, ...)`

    - Set hyperparameters.
* `fit(self, X, y=None)`

    - Fit the detector to the training data.
* `predict(self, X)`

    - Detect events on new data. 
    - Sparse format: One entry per detected event.
* `transform(X)` 

    - Detect events on new data. 
    - Dense format: One entry per input time point.
    - Default: Run `predict` + sparse to dense conversion.
* `transform_scores(X)` [optional] 

    - Return detection scores for each time point.

# `skchange` algorithm framework

* All detectors in `skchange` are ___search methods___ composed of an ___interval scorer___. 

    - Similar pattern as `ruptures`, but more general and performance oriented.
* `sktime` contains algorithms that does not follow this pattern.

### Interval scorer

- An abstraction for unifying evaluation of cost functions and statistical tests for change and anomaly detection.
- Role: Compute scores efficiently for many data cuts.
- __Cuts__: Intervals, intervals with a split point, intervals with an inner interval, ...

### Search method

- Detects events by optimizing the scores from the interval scorer.
- Role: Which cuts should be evaluated and how to convert the scores to detect events.


# Interval scoring

## Costs

Generate some Gaussian toy data with a single changepoint.

In [1]:
from skchange.datasets import generate_alternating_data

change_point = 50
single_cpt_df = generate_alternating_data(
    n_segments=2,
    segment_length=change_point,
    mean=5,
    random_state=0
)
single_cpt_df

Unnamed: 0,0
0,1.764052
1,0.400157
2,0.978738
3,2.240893
4,1.867558
...,...
95,5.706573
96,5.010500
97,6.785870
98,5.126912


In [2]:
import plotly.express as px

px.line(single_cpt_df)

Costs are evaluated over interval cuts. 

Fit and evaluate the L2 cost for a constant mean model.

In [3]:
from skchange.costs import L2Cost

cost = L2Cost()  # L2 cost function for a constant mean.
cost.fit(single_cpt_df)  # Precomputes sums and sums of squares.

# cuts = [start, end] for costs.
interval_cuts = [[0, 50], [25, 75], [50, 100]]
cost_values = cost.evaluate(interval_cuts)  # Uses precomputed sums + numba to evaluate.
cost_values


array([[ 63.34009223],
       [343.56245089],
       [ 37.59049314]])

The lower the cost, the better the fit. High cost in the middle interval because a constant mean fits poorly due to the change point at index 50.

## Change scores

Change scores are statistical tests that quantify the evidence for a single change in the data.

They are evaluated over `(start, split, end)` cuts, where the data subsets `X[start:split]` and `X[split:end]` are compared.

All costs can be used to construct a change score:
```
score.evaluate([start, split, end]) = cost.evaluate([start, end]) - cost.evaluate([start, split]) - cost.evaluate([split, end])
```

Fit and evaluate the change score constructed from the L2 cost.

In [4]:
from skchange.change_scores import ChangeScore
from skchange.costs import L2Cost

change_score = ChangeScore(cost=L2Cost())
change_score.fit(single_cpt_df)
interval_split_cuts = [[0, 25, 50], [25, 50, 75], [50, 75, 100]]
change_score_values = change_score.evaluate(interval_split_cuts)
change_score_values

array([[  5.59834604],
       [301.44887152],
       [  3.44050646]])

The higher the score, the more evidence for a change. 

High score for the second cut [25, 50, 75] because of the true change point at index 50.

Also support for change scores that cannot be formulated in terms of costs, or cases where computational efficiency can be gained from calculating the change scores directly.

In [5]:
from skchange.change_scores import CUSUM

# The CUSUM is the most famous change point test.
# It is a direct compution of the square root of the change score based on the L2 cost.
change_score = CUSUM()
change_score.fit(single_cpt_df)
change_score.evaluate(interval_split_cuts)**2

array([[  5.59834604],
       [301.44887152],
       [  3.44050646]])


# Change detection

Let us now detect the change point in the toy data.

In [6]:
from skchange.change_detectors import PELT

# Segment anomaly detection

## Anomaly scores

Two supported types of anomaly scores at the moment:

* __Saving__: The difference in cost between a fixed baseline parameter and an optimal parameter.

    - A global anomaly score. 
    - Assumes the fixed parameter is estimated robustly over the entire time series.
* __Local anomaly score__: The difference in cost between an inner interval (the anomaly) and an outer interval.

We only cover the saving here.

### Fixed cost

* Optimised/estimated parameter: `BaseCost(param=None)`
* Fixed parameter: `BaseCost(param: float|np.ndarray)`

In [7]:
from skchange.costs import L2Cost

baseline_cost = L2Cost(param = 0)  # fixed mean = 0
baseline_cost.fit(single_cpt_df)
baseline_cost.evaluate(interval_cuts)

array([[  64.32793769],
       [ 599.24590043],
       [1277.14080349]])

In [8]:
from skchange.anomaly_scores import Saving

saving = Saving(baseline_cost=baseline_cost)
saving.fit(single_cpt_df)
saving.evaluate(interval_cuts)

array([[9.87845452e-01],
       [2.55683450e+02],
       [1.23955031e+03]])