# 5. Feature Extraction

**Quick overview of the feature extraction**

This notebook provides instructions to use the feature extraction tools of the **3WToolkit v2.0.0**.

These tools demonstrates the capabilities of the toolkit focusing on transforming raw time-series data into meaningful features for machine learning applications.

The feature extraction tools in this package operate at event-level.

## 📋 Table of Contents

1. [Data preparation](#Data-preparation)
   1. [Importing required libraries](#Importing-Required-Libraries)
   2. [Loading and preparing the Dataset](#-loading-and-preparing-the-dataset)
2. [Wavelet Feature Extraction](#Wavelet-Feature-Extraction)
3. [Statistical Feature Extraction](#Statistical-Feature-Extraction)
4. [Exponentially Weighted Statistical Feature Extraction](#-exponentially-weighted-statistical-feature-extraction)
5. [Next Steps](#Next-Steps)
---

## Data preparation

### Importing required libraries

Let's start by importing all the necessary libraries and modules for feature extraction:


In [1]:
import pandas as pd
from pathlib import Path

# Core toolkit imports
from ThreeWToolkit.dataset import ParquetDataset
from ThreeWToolkit.core.base_dataset import ParquetDatasetConfig, EventPrefixEnum

# Preprocessing imports (needed for windowing)
from ThreeWToolkit.preprocessing import Windowing
from ThreeWToolkit.core.base_preprocessing import WindowingConfig

# Feature extraction imports
from ThreeWToolkit.feature_extraction import (
    ExtractStatisticalFeatures,
    ExtractWaveletFeatures, 
    ExtractEWStatisticalFeatures
)
from ThreeWToolkit.core.base_feature_extractor import (
    StatisticalConfig,
    WaveletConfig,
    EWStatisticalConfig
)

print("✅ All libraries imported successfully!")

✅ All libraries imported successfully!


### Loading and preparing the Dataset

Let's load the 3W dataset and prepare it for feature extraction. We'll use a subset of the data to make the examples more manageable:


In [2]:
# Define dataset path (adjust this path according to your setup)
dataset_path = Path("../../dataset")

# Configure dataset loading (only real events)
ds_config = ParquetDatasetConfig(
    path=dataset_path,
    clean_data=True,
    event_type=[EventPrefixEnum.REAL]  
)

# Load the dataset
ds = ParquetDataset(ds_config)
print("📁 Dataset loaded successfully!")
print(f"📊 Total number of events: {len(ds)}")

[ParquetDataset] Dataset found at ..\..\dataset
[ParquetDataset] Validating dataset integrity...
[ParquetDataset] Dataset integrity check passed!
📁 Dataset loaded successfully!
📊 Total number of events: 1119


Let's select any event and its labels:

In [3]:
id_event = 480

x_raw, y_raw = ds[id_event]["signal"], ds[id_event]["label"]
x_raw = x_raw[["T-JUS-CKP", "P-MON-CKP"]]
x_raw.head()

Unnamed: 0_level_0,T-JUS-CKP,P-MON-CKP
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1
2017-08-17 02:00:00,-0.422636,2.812677
2017-08-17 02:00:01,-0.422642,2.812677
2017-08-17 02:00:02,-0.422648,2.812674
2017-08-17 02:00:03,-0.422653,2.812674
2017-08-17 02:00:04,-0.422659,2.812674


-----------

## Wavelet Feature Extraction


The `ExtractWaveletFeatures` class applies the **Stationary Wavelet Transform (SWT)**, a signal processing method that decomposes each window of the signal into multiple frequency components. This approach can reveal patterns that are often hidden from conventional statistical methods.

At each decomposition level, SWT produces two types of coefficients:

- **Approximation Coefficients (A)**: Represent the low-frequency content of the signal, capturing its general trend or smooth behavior within the window. You can think of these as a "smoothed" version of the original signal.

- **Detail Coefficients (D)**: Represent the high-frequency content, including noise, spikes, and abrupt changes. These features help the model distinguish between the overall signal shape and its fine-grained, high-frequency variations.

In [4]:
# Define parameters that will be shared between windowing and feature extraction
# The window_size for the wavelet transform is determined by its level
LEVEL = 3
WINDOW_SIZE = 2**LEVEL  
OVERLAP = 0.875       # This is equivalent to a stride of 1 for a window of 8

print("🔧 Windowing Configuration:")
print(f"   Window size: {WINDOW_SIZE}")
print(f"   Overlap: {OVERLAP} ({OVERLAP*100}%)")
print(f"   Stride: {int(WINDOW_SIZE * (1 - OVERLAP))}")

# Configure and instantiate the Windowing step
# We use a 'boxcar' window because we don't want to change the signal
# before applying the wavelet transform.
windowing_config = WindowingConfig(
    window_size=WINDOW_SIZE,
    overlap=OVERLAP,
    window="boxcar" 
)
windowing_step = Windowing(windowing_config)

# Run the windowing step on the raw data
windowed_x = windowing_step(x_raw)

# Align the labels (y)
# The Windowing class doesn't handle labels, so we must align them manually.
# The label for each window corresponds to the label at the end of that window in the original series.
step = int(WINDOW_SIZE * (1 - OVERLAP))
if step == 0: 
    step = 1
end_of_window_indices = [
    i + WINDOW_SIZE - 1 for i in range(0, len(x_raw) - WINDOW_SIZE + 1, step)
]
aligned_y = y_raw.iloc[end_of_window_indices]

# Combine the windowed features and aligned labels into a single DataFrame
#  We reset the index to ensure they concatenate correctly.
windowed_data = pd.concat([
    windowed_x.reset_index(drop=True),
    aligned_y.reset_index(drop=True)
], axis=1)

print("\n✅ Windowing completed!")
print(f"   Original data shape: {x_raw.shape}")
print(f"   Windowed data shape: {windowed_data.shape}")
print(f"   Number of windows created: {len(windowed_data)}")

# Display the first few rows of windowed data
print("\n📋 First few rows of windowed data:")
windowed_data.head()


🔧 Windowing Configuration:
   Window size: 8
   Overlap: 0.875 (87.5%)
   Stride: 1

✅ Windowing completed!
   Original data shape: (21549, 2)
   Windowed data shape: (21542, 18)
   Number of windows created: 21542

📋 First few rows of windowed data:


Unnamed: 0,var1_t0,var1_t1,var1_t2,var1_t3,var1_t4,var1_t5,var1_t6,var1_t7,var2_t0,var2_t1,var2_t2,var2_t3,var2_t4,var2_t5,var2_t6,var2_t7,win,class
0,-0.422636,-0.422642,-0.422648,-0.422653,-0.422659,-0.422664,-0.42267,-0.422676,2.812677,2.812677,2.812674,2.812674,2.812674,2.812674,2.812674,2.812671,1,0
1,-0.422642,-0.422648,-0.422653,-0.422659,-0.422664,-0.42267,-0.422676,-0.422682,2.812677,2.812674,2.812674,2.812674,2.812674,2.812674,2.812671,2.812671,2,0
2,-0.422648,-0.422653,-0.422659,-0.422664,-0.42267,-0.422676,-0.422682,-0.422687,2.812674,2.812674,2.812674,2.812674,2.812674,2.812671,2.812671,2.812671,3,0
3,-0.422653,-0.422659,-0.422664,-0.42267,-0.422676,-0.422682,-0.422687,-0.422693,2.812674,2.812674,2.812674,2.812674,2.812671,2.812671,2.812671,2.812671,4,0
4,-0.422659,-0.422664,-0.42267,-0.422676,-0.422682,-0.422687,-0.422693,-0.422698,2.812674,2.812674,2.812674,2.812671,2.812671,2.812671,2.812671,2.812671,5,0


In [5]:
# Configure the feature extractor
wavelet_config = WaveletConfig(
    level=LEVEL, 
    wavelet="haar",
)

# Instantiate the extractor
feature_extractor = ExtractWaveletFeatures(wavelet_config)
feature_extractor.is_windowed=True

# Run the feature extraction step by calling the instance
wavelet_features = feature_extractor(windowed_data)

# Display the results
print("Shape of the final extracted features:", wavelet_features.shape)
print("\nColumns are named 'var<index>_<feature>', and the 'label' column is preserved.")
wavelet_features.head()


Shape of the final extracted features: (21542, 14)

Columns are named 'var<index>_<feature>', and the 'label' column is preserved.


Unnamed: 0,var1_A3,var1_D3,var1_A2,var1_D2,var1_A1,var1_D1,var1_A0,var2_A3,var2_D3,var2_A2,var2_D2,var2_A1,var2_D1,var2_A0
0,-1.195452,-3.2e-05,-0.845335,-1.1e-05,-0.59775,-4e-06,-0.422676,7.955445,-3e-06,5.625347,-1.574667e-06,3.97772,-2.226915e-06,2.812671
1,-1.195468,-3.2e-05,-0.845346,-1.2e-05,-0.597758,-4e-06,-0.422682,7.955443,-3e-06,5.625345,-3.149334e-06,3.977718,-9.912177000000001e-17,2.812671
2,-1.195484,-3.2e-05,-0.845357,-1.1e-05,-0.597766,-4e-06,-0.422687,7.955441,-3e-06,5.625344,-1.574667e-06,3.977718,-9.912177000000001e-17,2.812671
3,-1.1955,-3.2e-05,-0.845369,-1.1e-05,-0.597774,-4e-06,-0.422693,7.955439,-4e-06,5.625342,9.022461000000001e-17,3.977718,-9.912177000000001e-17,2.812671
4,-1.195516,-3.2e-05,-0.84538,-1.1e-05,-0.597782,-4e-06,-0.422698,7.955438,-3e-06,5.625342,9.022461000000001e-17,3.977718,-9.912177000000001e-17,2.812671


In [6]:
# Configure with an offset
offset = 20
config_offset = WaveletConfig(level=LEVEL, overlap=0.875, offset=offset)
extractor_offset = ExtractWaveletFeatures(config_offset)
extractor_offset.is_windowed = True
features_offset = extractor_offset(windowed_data)

features_offset.head()


Unnamed: 0,var1_A3,var1_D3,var1_A2,var1_D2,var1_A1,var1_D1,var1_A0,var2_A3,var2_D3,var2_A2,var2_D2,var2_A1,var2_D1,var2_A0
0,-1.195771,-3.2e-05,-0.84556,-1.2e-05,-0.59791,-4e-06,-0.422789,7.955409,-3e-06,5.625322,-1.574667e-06,3.977702,-2.226915e-06,2.812658
1,-1.195787,-3.2e-05,-0.845572,-1.1e-05,-0.597917,-4e-06,-0.422794,7.955407,-3e-06,5.62532,-3.149334e-06,3.9777,9.965623000000001e-17,2.812658
2,-1.195803,-3.2e-05,-0.845583,-1.1e-05,-0.597925,-4e-06,-0.4228,7.955405,-3e-06,5.625318,-1.574667e-06,3.9777,9.965623000000001e-17,2.812658
3,-1.195819,-3.2e-05,-0.845594,-1.1e-05,-0.597933,-4e-06,-0.422806,7.955404,-4e-06,5.625317,9.022321e-17,3.9777,9.965623000000001e-17,2.812658
4,-1.195835,-3.2e-05,-0.845605,-1.1e-05,-0.597941,-4e-06,-0.422811,7.955403,-3e-06,5.625317,9.022321e-17,3.9777,9.965623000000001e-17,2.812658


-------

## Statistical Feature Extraction

This is the most common approach to feature extraction for time-series data. The `ExtractStatisticalFeatures` class takes the pre-windowed data and calculates a set of standard statistical descriptors for each window. These features summarize the shape and distribution of the data within that specific time segment.

The features extracted are:
* **`mean`, `std`**: Describe the central tendency and dispersion (volatility).
* **`skew`, `kurtosis`**: Describe the shape of the distribution (asymmetry and presence of outliers).
* **`min`, `1qrt`, `med`, `3qrt`, `max`**: Provide a summary of the distribution through quartiles.

Since our data is already windowed from the previous step, we can reuse the `windowed_data` `DataFrame` directly.


In [7]:
# Configure the statistical feature extractor
statistical_config = StatisticalConfig(
    window_size=WINDOW_SIZE,
    overlap=OVERLAP,
)

# Instantiate the extractor
statistical_feature_extractor = ExtractStatisticalFeatures(statistical_config)
statistical_feature_extractor.is_windowed = True

# Run the feature extraction step by calling the instance
statistical_features = statistical_feature_extractor(windowed_data)

# Display the results
print("Shape of the final statistical features:", statistical_features.shape)
print("\nColumns are named 'var<index>_<feature>', and the label column is preserved.")
statistical_features.head()


Shape of the final statistical features: (21542, 18)

Columns are named 'var<index>_<feature>', and the label column is preserved.


Unnamed: 0,var1_mean,var1_std,var1_skew,var1_kurt,var1_min,var1_1qrt,var1_med,var1_3qrt,var1_max,var2_mean,var2_std,var2_skew,var2_kurt,var2_min,var2_1qrt,var2_med,var2_3qrt,var2_max
0,-0.422656,1.3e-05,0.014377,-1.225587,-0.422676,-0.422666,-0.422656,-0.422646,-0.422636,2.812675,2e-06,-0.05439506,-0.3138,2.812671,2.812674,2.812674,2.812675,2.812677
1,-0.422662,1.3e-05,0.0,-1.218076,-0.422682,-0.422672,-0.422662,-0.422652,-0.422642,2.812674,2e-06,0.05439506,-0.3138,2.812671,2.812673,2.812674,2.812674,2.812677
2,-0.422667,1.3e-05,0.005683,-1.242868,-0.422687,-0.422677,-0.422667,-0.422658,-0.422648,2.812673,2e-06,-0.5163978,-1.733333,2.812671,2.812671,2.812674,2.812674,2.812674
3,-0.422673,1.3e-05,0.0,-1.248227,-0.422693,-0.422683,-0.422673,-0.422663,-0.422653,2.812673,2e-06,-4.230316e-10,-2.0,2.812671,2.812671,2.812673,2.812674,2.812674
4,-0.422679,1.3e-05,-0.005683,-1.242868,-0.422698,-0.422688,-0.422679,-0.422669,-0.422659,2.812672,2e-06,0.5163978,-1.733333,2.812671,2.812671,2.812671,2.812674,2.812674


---------

## Exponentially Weighted Statistical Feature Extraction


The `ExtractEWStatisticalFeatures` class provides a specialized version of the standard statistical features. The "EW" stands for **Exponentially Weighted**.

In this method, not all data points in a window are treated equally. Instead, more recent data points are given progressively higher weight than older points. The rate at which the importance of older data "decays" is controlled by the `decay` parameter.

This is particularly useful in scenarios where the most recent behavior within a window is more predictive of the outcome than the behavior at the beginning of the window. It creates features that are more sensitive to the latest changes in the signal.

Again, we will use the same `windowed_data` `DataFrame` as input.


In [8]:
# Configure the EW statistical feature extractor
# The decay parameter is specific to this class and controls the weighting
ew_statistical_config = EWStatisticalConfig(
    window_size=WINDOW_SIZE, 
    overlap=OVERLAP,       
    decay=0.9,             
)

# Instantiate the extractor
ew_feature_extractor = ExtractEWStatisticalFeatures(ew_statistical_config)
ew_feature_extractor.is_windowed = True

# Run the feature extraction step by calling the instance
ew_statistical_features = ew_feature_extractor(windowed_data)

# Display the results
print("Shape of the final EW statistical features:", ew_statistical_features.shape)
print("\nColumns are named 'var<index>_ew_<feature>', and the label column is preserved.")
ew_statistical_features.head()


Shape of the final EW statistical features: (21542, 18)

Columns are named 'var<index>_ew_<feature>', and the label column is preserved.


Unnamed: 0,var1_ew_mean,var1_ew_std,var1_ew_skew,var1_ew_kurt,var1_ew_min,var1_ew_1qrt,var1_ew_med,var1_ew_3qrt,var1_ew_max,var2_ew_mean,var2_ew_std,var2_ew_skew,var2_ew_kurt,var2_ew_min,var2_ew_1qrt,var2_ew_med,var2_ew_3qrt,var2_ew_max
0,-0.422659,1.3e-05,0.232132,1.404927,-1.224299,-0.490562,0.205059,0.938796,1.672533,2.812674,2e-06,0.230828,0.543234,-0.883502,0.184209,0.184209,0.451136,1.251919
1,-0.422665,1.3e-05,0.219483,1.399988,-1.229061,-0.495394,0.219217,0.933829,1.667496,2.812673,2e-06,0.269734,0.516521,-0.613878,0.185252,0.451629,0.451629,1.517135
2,-0.42267,1.3e-05,0.227676,1.380152,-1.210163,-0.509903,0.228728,0.928988,1.667619,2.812672,2e-06,0.203801,0.196232,-0.413408,-0.413408,0.767977,0.767977,0.767977
3,-0.422676,1.3e-05,0.223323,1.376556,-1.222227,-0.511198,0.219049,0.949295,1.660325,2.812672,2e-06,0.307273,0.299193,-0.265239,-0.265239,0.332321,0.92988,0.92988
4,-0.422682,1.3e-05,0.21777,1.381225,-1.228161,-0.49018,0.209464,0.947445,1.64709,2.812671,2e-06,0.390031,0.437863,-0.133488,-0.133488,-0.133488,1.117152,1.117152


## Next Steps

🎉 **Nice!** Now you can use the **3W Toolkit** feature extraction tools!

### What's Next?

1. **Data Visualization**: Discover how to visualize processed data in [Notebook 6: Data Visualization](6_data_visualization.ipynb)
2. **Model Training and Evaluation**: Discover how to train and evaluate machine learning models in [Notebook 7: Model Training and Evaluation](7_model_training_and_evaluation.ipynb)
---

---

**📚 Tutorial Navigation:**
- **Previous**: [4. Preprocessing](4_preprocessing.ipynb)
- **Next**: [6. Data Visualization](6_data_visualization.ipynb)

**🔗 Additional Resources:**
- [3W Project Repository](https://github.com/petrobras/3W)
- [3W Dataset on Figshare](https://figshare.com/projects/3W_Dataset/251195)
