# Feature Extraction

In most time-series machine learning tasks, raw signal data is not used directly to train models. Instead, we perform **feature extraction** to transform segments, or "windows", of the time-series into a set of informative features that better represent the underlying patterns. This process converts a sequence of data points into a single row of features that a model can learn from.

**IMPORTANT:** All feature extractor classes in this toolkit are designed to operate on **pre-windowed data**. This means you **must** first use the `Windowing` class to prepare your time-series. The workflow is always:

1.  **Windowing Step:** Convert the raw time-series `DataFrame` into a new `DataFrame` where each row represents a single window.
2.  **Feature Extraction Step:** Pass the windowed `DataFrame` to one of the feature extractor classes.

In this notebook, we will demonstrate the three primary feature extraction methods available in the toolkit, following this two-step process.

In [1]:
import pandas as pd
import torch
import numpy as np
import pywt
import sys
import os
import matplotlib.pyplot as plt
from pathlib import Path
from typing import Optional
from pydantic import BaseModel
from abc import ABC

# Adds the root directory to the sys.path
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '../../')))

In [2]:
from ThreeWToolkit.feature_extraction import (
        extract_wavelet_features,
        extract_statistical_features,
        extract_exponential_statistics_features)

from ThreeWToolkit.core.base_step import BaseStep
from ThreeWToolkit.core.base_preprocessing import WindowingConfig
from ThreeWToolkit.core.base_feature_extractor import WaveletConfig, StatisticalConfig, EWStatisticalConfig
from ThreeWToolkit.preprocessing._data_processing import Windowing

from ThreeWToolkit.dataset import ParquetDataset
from ThreeWToolkit.core.base_dataset import EventPrefixEnum, ParquetDatasetConfig

## Loading 3W Dataset

In [3]:
dataset_path = Path("./dataset")
event_types = [EventPrefixEnum.REAL]
ds_config = ParquetDatasetConfig(path=dataset_path, clean_data=True, event_type=event_types) # load all files, target defaults to 'class'
ds = ParquetDataset(ds_config)
len(ds)

[ParquetDataset] Dataset found at dataset
[ParquetDataset] Validating dataset integrity...
[ParquetDataset] Dataset integrity check passed!


1119

In [4]:
X_raw = ds[0]['signal']
y_raw = ds[0]['label']

X_raw

Unnamed: 0_level_0,ABER-CKGL,ABER-CKP,ESTADO-DHSV,ESTADO-M1,ESTADO-M2,ESTADO-PXO,ESTADO-SDV-GL,ESTADO-SDV-P,ESTADO-W1,ESTADO-W2,...,P-JUS-CKGL,P-JUS-CKP,P-MON-CKP,P-PDG,P-TPT,QGL,T-JUS-CKP,T-MON-CKP,T-PDG,T-TPT
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2017-09-18 01:01:14,2.653941,0.012941,0.0,0.0,0.0,0.0,0.914069,0.312558,0.650525,0.0,...,0.216116,0.0,-0.710698,0.298795,0.307882,0.032558,-1.713340,0.0,-0.535130,-1.679388
2017-09-18 01:01:15,2.653941,0.012941,0.0,0.0,0.0,0.0,0.914069,0.312558,0.650525,0.0,...,0.216116,0.0,-0.710603,0.298832,0.307838,0.028674,-1.713357,0.0,-0.535132,-1.679398
2017-09-18 01:01:16,2.653941,0.012941,0.0,0.0,0.0,0.0,0.914069,0.312558,0.650525,0.0,...,0.216116,0.0,-0.710507,0.298852,0.307879,0.024790,-1.713374,0.0,-0.535132,-1.679375
2017-09-18 01:01:17,2.653941,0.012942,0.0,0.0,0.0,0.0,0.914069,0.312558,0.650525,0.0,...,0.216116,0.0,-0.710412,0.298873,0.307920,0.020906,-1.713391,0.0,-0.535133,-1.679353
2017-09-18 01:01:18,2.653941,0.012942,0.0,0.0,0.0,0.0,0.914069,0.312558,0.650525,0.0,...,0.216116,0.0,-0.710102,0.298846,0.308028,0.017021,-1.713408,0.0,-0.535135,-1.679298
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2017-09-18 06:59:56,2.653941,-0.580989,0.0,0.0,0.0,0.0,0.914069,0.312558,0.650525,0.0,...,0.195051,0.0,-0.467794,0.300258,0.162198,-0.127575,-2.358256,0.0,-0.535509,-1.470326
2017-09-18 06:59:57,2.653941,-0.580989,0.0,0.0,0.0,0.0,0.914069,0.312558,0.650525,0.0,...,0.195061,0.0,-0.468104,0.300290,0.162376,-0.130274,-2.357908,0.0,-0.535511,-1.470715
2017-09-18 06:59:58,2.653941,-0.580989,0.0,0.0,0.0,0.0,0.914069,0.312558,0.650525,0.0,...,0.195070,0.0,-0.468415,0.300356,0.162515,-0.132972,-2.357560,0.0,-0.535515,-1.471482
2017-09-18 06:59:59,2.653941,-0.580989,0.0,0.0,0.0,0.0,0.914069,0.312558,0.650525,0.0,...,0.195080,0.0,-0.468725,0.300422,0.162653,-0.135672,-2.357213,0.0,-0.535518,-1.472251


## Wavelet Feature Extraction

The `ExtractWaveletFeatures` class uses a signal processing technique called the Stationary Wavelet Transform (SWT). This method decomposes the signal within each window into different frequency components, which can often capture patterns that are invisible to standard statistical measures.

For each level of decomposition, two sets of coefficients are generated:
* **Approximation Coefficients (A):** These capture the low-frequency, underlying trend of the signal. Think of it as a "smoothed" version of the signal within the window.
* **Detail Coefficients (D):** These capture the high-frequency components, representing noise, spikes, and other abrupt changes. 
These features allow a model to differentiate between the general shape of a signal and its more noisy, high-frequency texture.

In [5]:
# Define parameters that will be shared between windowing and feature extraction
# The window_size for the wavelet transform is determined by its level
LEVEL = 3
WINDOW_SIZE = 2**LEVEL  # This will be 8
OVERLAP = 0.875       # This is equivalent to a stride of 1 for a window of 8

# Configure and instantiate the Windowing step
# We use a 'boxcar' window because we don't want to alter the signal
# before applying the wavelet transform.
windowing_config = WindowingConfig(
    window_size=WINDOW_SIZE,
    overlap=OVERLAP,
    window="boxcar" 
)
windowing_step = Windowing(windowing_config)

# Run the windowing step on the raw data
windowed_x = windowing_step(X_raw)

# Align the labels (y)
# The Windowing class doesn't handle labels, so we must align them manually.
# The label for each window corresponds to the label at the end of that window in the original series.
step = int(WINDOW_SIZE * (1 - OVERLAP))
if step == 0: step = 1
end_of_window_indices = [i + WINDOW_SIZE - 1 for i in range(0, len(X_raw) - WINDOW_SIZE + 1, step)]
aligned_y = y_raw.iloc[end_of_window_indices]

# Combine the windowed features and aligned labels into a single DataFrame
#  We reset the index to ensure they concatenate correctly.
windowed_data = pd.concat([
    windowed_x.reset_index(drop=True),
    aligned_y.reset_index(drop=True)
], axis=1)

print("Shape of the pre-windowed data:", windowed_data.shape)
print("The data is now ready for the feature extraction step.")
windowed_data.head()

Shape of the pre-windowed data: (21520, 178)
The data is now ready for the feature extraction step.


Unnamed: 0,var1_t0,var1_t1,var1_t2,var1_t3,var1_t4,var1_t5,var1_t6,var1_t7,var2_t0,var2_t1,...,var22_t0,var22_t1,var22_t2,var22_t3,var22_t4,var22_t5,var22_t6,var22_t7,win,class
0,2.653941,2.653941,2.653941,2.653941,2.653941,2.653941,2.653941,2.653941,0.012941,0.012941,...,-1.679388,-1.679398,-1.679375,-1.679353,-1.679298,-1.679244,-1.679204,-1.679165,1,3
1,2.653941,2.653941,2.653941,2.653941,2.653941,2.653941,2.653941,2.653941,0.012941,0.012941,...,-1.679398,-1.679375,-1.679353,-1.679298,-1.679244,-1.679204,-1.679165,-1.679125,2,3
2,2.653941,2.653941,2.653941,2.653941,2.653941,2.653941,2.653941,2.653941,0.012941,0.012942,...,-1.679375,-1.679353,-1.679298,-1.679244,-1.679204,-1.679165,-1.679125,-1.679086,3,3
3,2.653941,2.653941,2.653941,2.653941,2.653941,2.653941,2.653941,2.653941,0.012942,0.012942,...,-1.679353,-1.679298,-1.679244,-1.679204,-1.679165,-1.679125,-1.679086,-1.678928,4,3
4,2.653941,2.653941,2.653941,2.653941,2.653941,2.653941,2.653941,2.653941,0.012942,0.012943,...,-1.679298,-1.679244,-1.679204,-1.679165,-1.679125,-1.679086,-1.678928,-1.678771,5,3


In [6]:
# Configure the feature extractor
wavelet_config = WaveletConfig(
    level=LEVEL, 
    wavelet="haar",
    label_column="label"
)

# Instantiate the extractor
feature_extractor = extract_wavelet_features.ExtractWaveletFeatures(wavelet_config)

# Manually set the is_windowed flag, which is required for the extractor to run.
feature_extractor.is_windowed=True

# Run the feature extraction step by calling the instance
wavelet_features = feature_extractor(windowed_data)

# Display the results
print("Shape of the final extracted features:", wavelet_features.shape)
print("\nColumns are named 'var<index>_<feature>', and the 'label' column is preserved.")
wavelet_features.head()

Shape of the final extracted features: (21520, 154)

Columns are named 'var<index>_<feature>', and the 'label' column is preserved.


Unnamed: 0,var1_A3,var1_D3,var1_A2,var1_D2,var1_A1,var1_D1,var1_A0,var2_A3,var2_D3,var2_A2,...,var21_A1,var21_D1,var21_A0,var22_A3,var22_D3,var22_A2,var22_D2,var22_A1,var22_D1,var22_A0
0,7.50648,2.618263e-16,5.307883,7.260208000000001e-17,3.75324,6.961218e-18,2.653941,0.036605,2e-06,0.025885,...,-0.756799,-6.066313e-07,-0.535138,-4.749786,0.000213,-3.358455,8.7e-05,-2.374725,2.8e-05,-1.679165
1,7.50648,2.618263e-16,5.307883,7.260208000000001e-17,3.75324,6.961218e-18,2.653941,0.036606,2e-06,0.025886,...,-0.7568,-3.033157e-07,-0.535138,-4.749693,0.000243,-3.358369,7.9e-05,-2.37467,2.8e-05,-1.679125
2,7.50648,2.618263e-16,5.307883,7.260208000000001e-17,3.75324,6.961218e-18,2.653941,0.036607,2e-06,0.025887,...,-0.756801,-6.066313e-07,-0.535139,-4.749583,0.000244,-3.35829,7.9e-05,-2.374614,2.8e-05,-1.679086
3,7.50648,2.618263e-16,5.307883,7.260208000000001e-17,3.75324,6.961218e-18,2.653941,0.036609,2e-06,0.025888,...,-0.756804,-2.729841e-06,-0.535143,-4.749425,0.000281,-3.358152,0.000138,-2.374474,0.000111,-1.678928
4,7.50648,2.618263e-16,5.307883,7.260208000000001e-17,3.75324,6.961218e-18,2.653941,0.03661,2e-06,0.025888,...,-0.756809,-2.729841e-06,-0.535147,-4.749219,0.000354,-3.357955,0.000256,-2.374252,0.000111,-1.678771


In [7]:
# Configure with an offset
offset = 20
config_offset = extract_wavelet_features.WaveletConfig(level=LEVEL, overlap=0.875, offset=offset)
extractor_offset = extract_wavelet_features.ExtractWaveletFeatures(config_offset)
extractor_offset.is_windowed = True
features_offset = extractor_offset(windowed_data)

print(f"\n--- Using offset={offset} ---")

features_offset.head()


--- Using offset=20 ---


Unnamed: 0,var1_A3,var1_D3,var1_A2,var1_D2,var1_A1,var1_D1,var1_A0,var2_A3,var2_D3,var2_A2,...,var21_A1,var21_D1,var21_A0,var22_A3,var22_D3,var22_A2,var22_D2,var22_A1,var22_D1,var22_A0
0,7.50648,2.618263e-16,5.307883,7.260208000000001e-17,3.75324,6.961218e-18,2.653941,0.036628,2e-06,0.025901,...,-0.756841,-3.033157e-07,-0.535168,-4.748342,-0.000307,-3.357802,-0.000142,-2.374425,-5.4e-05,-1.67901
1,7.50648,2.618263e-16,5.307883,7.260208000000001e-17,3.75324,6.961218e-18,2.653941,0.036629,2e-06,0.025902,...,-0.756842,-1.213263e-06,-0.535169,-4.748511,-0.000348,-3.35795,-0.000158,-2.374541,-6.3e-05,-1.679098
2,7.50648,2.618263e-16,5.307883,7.260208000000001e-17,3.75324,6.961218e-18,2.653941,0.03663,2e-06,0.025903,...,-0.756845,-1.213263e-06,-0.535171,-4.748696,-0.000395,-3.358114,-0.000171,-2.374666,-6.3e-05,-1.679187
3,7.50648,2.618263e-16,5.307883,7.260208000000001e-17,3.75324,6.961218e-18,2.653941,0.036631,2e-06,0.025904,...,-0.756846,-3.033157e-07,-0.535171,-4.748894,-0.000432,-3.358281,-0.000173,-2.374785,-5.6e-05,-1.679267
4,7.50648,2.618263e-16,5.307883,7.260208000000001e-17,3.75324,6.961218e-18,2.653941,0.036633,2e-06,0.025905,...,-0.756847,2.4052860000000003e-17,-0.535171,-4.749107,-0.000458,-3.358449,-0.000164,-2.374898,-5.6e-05,-1.679346


## Statistical Feature Extraction

This is the most common approach to feature extraction for time-series data. The `ExtractStatisticalFeatures` class takes the pre-windowed data and calculates a set of standard statistical descriptors for each window. These features summarize the shape and distribution of the data within that specific time segment.

The features extracted are:
* **`mean`, `std`**: Describe the central tendency and dispersion (volatility).
* **`skew`, `kurtosis`**: Describe the shape of the distribution (asymmetry and presence of outliers).
* **`min`, `1qrt`, `med`, `3qrt`, `max`**: Provide a summary of the distribution through quartiles.

Since our data is already windowed from the previous step, we can reuse the `windowed_data` `DataFrame` directly.

In [8]:
# Configure the statistical feature extractor
statistical_config = StatisticalConfig(
    window_size=WINDOW_SIZE,
    overlap=OVERLAP,
    label_column="label"
)

# Instantiate the extractor
statistical_feature_extractor = extract_statistical_features.ExtractStatisticalFeatures(statistical_config)
statistical_feature_extractor.is_windowed = True

# Run the feature extraction step by calling the instance
statistical_features = statistical_feature_extractor(windowed_data)

# Display the results
print("Shape of the final statistical features:", statistical_features.shape)
print("\nColumns are named 'var<index>_<feature>', and the label column is preserved.")
statistical_features.head()

Shape of the final statistical features: (21520, 198)

Columns are named 'var<index>_<feature>', and the label column is preserved.


Unnamed: 0,var1_mean,var1_std,var1_skew,var1_kurt,var1_min,var1_1qrt,var1_med,var1_3qrt,var1_max,var2_mean,...,var21_max,var22_mean,var22_std,var22_skew,var22_kurt,var22_min,var22_1qrt,var22_med,var22_3qrt,var22_max
0,2.653941,0.0,0.0,0.0,2.653941,2.653941,2.653941,2.653941,2.653941,0.012942,...,-0.53513,-1.679303,8.4e-05,0.395814,-1.387369,-1.679398,-1.679379,-1.679326,-1.679234,-1.679165
1,2.653941,0.0,0.0,0.0,2.653941,2.653941,2.653941,2.653941,2.653941,0.012942,...,-0.535132,-1.67927,9.5e-05,0.105418,-1.4345,-1.679398,-1.679359,-1.679271,-1.679194,-1.679125
2,2.653941,0.0,0.0,0.0,2.653941,2.653941,2.653941,2.653941,2.653941,0.012943,...,-0.535132,-1.679231,9.9e-05,-0.066725,-1.328554,-1.679375,-1.679312,-1.679224,-1.679155,-1.679086
3,2.653941,0.0,0.0,0.0,2.653941,2.653941,2.653941,2.653941,2.653941,0.012943,...,-0.535133,-1.679175,0.000124,0.526921,-0.369943,-1.679353,-1.679257,-1.679184,-1.679115,-1.678928
4,2.653941,0.0,0.0,0.0,2.653941,2.653941,2.653941,2.653941,2.653941,0.012943,...,-0.535135,-1.679103,0.000163,0.865929,-0.365397,-1.679298,-1.679214,-1.679145,-1.679046,-1.678771


## Exponentially Weighted Statistical Feature Extraction

The `ExtractEWStatisticalFeatures` class provides a specialized version of the standard statistical features. The "EW" stands for **Exponentially Weighted**.

In this method, not all data points in a window are treated equally. Instead, more recent data points are given progressively higher weight than older points. The rate at which the importance of older data "decays" is controlled by the `decay` parameter.

This is particularly useful in scenarios where the most recent behavior within a window is more predictive of the outcome than the behavior at the beginning of the window. It creates features that are more sensitive to the latest changes in the signal.

Again, we will use the same `windowed_data` `DataFrame` as input.

In [9]:
# Configure the EW statistical feature extractor
# The decay parameter is specific to this class and controls the weighting.
ew_statistical_config = EWStatisticalConfig(
    window_size=WINDOW_SIZE, # Using the same WINDOW_SIZE from the previous step
    overlap=OVERLAP,       # Using the same OVERLAP
    decay=0.9,             # A decay factor of 0.9 gives more weight to recent points
    label_column="label"
)

# Instantiate the extractor
ew_feature_extractor = extract_exponential_statistics_features.ExtractEWStatisticalFeatures(ew_statistical_config)
ew_feature_extractor.is_windowed = True

# Run the feature extraction step by calling the instance
ew_statistical_features = ew_feature_extractor(windowed_data)

# Display the results
print("Shape of the final EW statistical features:", ew_statistical_features.shape)
print("\nColumns are named 'var<index>_ew_<feature>', and the label column is preserved.")
ew_statistical_features.head()

Shape of the final EW statistical features: (21520, 198)

Columns are named 'var<index>_ew_<feature>', and the label column is preserved.


Unnamed: 0,var1_ew_mean,var1_ew_std,var1_ew_skew,var1_ew_kurt,var1_ew_min,var1_ew_1qrt,var1_ew_med,var1_ew_3qrt,var1_ew_max,var2_ew_mean,...,var21_ew_max,var22_ew_mean,var22_ew_std,var22_ew_skew,var22_ew_kurt,var22_ew_min,var22_ew_1qrt,var22_ew_med,var22_ew_3qrt,var22_ew_max
0,2.653941,5.177617e-07,0.039699,0.013543,0.341135,0.341135,0.341135,0.341135,0.341135,0.012942,...,1.374752,-1.679283,8.5e-05,0.02558,1.390307,-1.326744,-1.105944,-0.494716,0.569483,1.368343
1,2.653941,5.177617e-07,0.039699,0.013543,0.341135,0.341135,0.341135,0.341135,0.341135,0.012943,...,1.144517,-1.679247,9.4e-05,-0.231377,1.573851,-1.580895,-1.171037,-0.249718,0.557943,1.284666
2,2.653941,5.177617e-07,0.039699,0.013543,0.341135,0.341135,0.341135,0.341135,0.341135,0.012943,...,1.322047,-1.679208,9.6e-05,-0.380822,1.803683,-1.723689,-1.073552,-0.167735,0.541276,1.250286
3,2.653941,5.177617e-07,0.039699,0.013543,0.341135,0.341135,0.341135,0.341135,0.341135,0.012943,...,1.261031,-1.679146,0.000127,0.305478,2.273038,-1.618491,-0.871122,-0.301486,0.238103,1.698316
4,2.653941,5.177617e-07,0.039699,0.013543,0.341135,0.341135,0.341135,0.341135,0.341135,0.012944,...,1.02498,-1.679064,0.000171,0.543951,1.990343,-1.357954,-0.86835,-0.468024,0.10326,1.70219
