# 4. Preprocessing

**Quick overview of the preprocessing features**

This notebook provides instructions to use the preprocessing tools of the **3WToolkit v2.0.0**.

These tools complement the canonical data-cleanup, and may be used instead, or in combination with, if the user desires to tackle the problems in a different approach.

The preprocessing tools in this package operate at event-level.

## 📋 Table of Contents

1. [Data preparation](#Data-preparation)
   1. [Imputation](#Handling-missing-data)
   2. [Normalization](#Normalization)
   3. [Windowing](#Windowing)
   4. [Renaming](#Column-renaming)
2. [Next Steps](#Next-Steps)
---


## Data preparation

### Handling missing data

Handling missing data is one of the most common tasks when dealing with datasets such as the 3W Dataset. The **3W Toolkit v2.0.0** is shipped with some strategies for missing data imputation.

Let's check them out.

We begin by creating some mock data:

In [1]:
from ThreeWToolkit.dataset import ParquetDataset, ParquetDatasetConfig

config = ParquetDatasetConfig(path="../../dataset", clean_data=True)
dataset = ParquetDataset(config)

sample = dataset[0]["signal"][["P-PDG", "T-TPT", "P-TPT"]].copy()
sample.head(10)

[ParquetDataset] Dataset found at ../../dataset
[ParquetDataset] Validating dataset integrity...
[ParquetDataset] Dataset integrity check passed!


Unnamed: 0_level_0,P-PDG,T-TPT,P-TPT
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-04-26 00:20:00,0.717287,0.85225,1.002349
2018-04-26 00:20:01,0.717285,0.85225,1.002358
2018-04-26 00:20:02,0.717287,0.85225,1.002366
2018-04-26 00:20:03,0.717288,0.852253,1.002375
2018-04-26 00:20:04,0.717289,0.852253,1.002383
2018-04-26 00:20:05,0.717291,0.852253,1.002392
2018-04-26 00:20:06,0.717294,0.852253,1.002402
2018-04-26 00:20:07,0.717298,0.852256,1.002407
2018-04-26 00:20:08,0.7173,0.852253,1.002391
2018-04-26 00:20:09,0.717304,0.852253,1.002383


Lets simulate some missing data in non-empty columns:

In [2]:
import pandas as pd

sample_with_missing = sample.copy() # Let's copy a slice of the dataset
sample_with_missing.iloc[::3] = pd.NA # simulate missing

sample_with_missing.head(10)

Unnamed: 0_level_0,P-PDG,T-TPT,P-TPT
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-04-26 00:20:00,,,
2018-04-26 00:20:01,0.717285,0.85225,1.002358
2018-04-26 00:20:02,0.717287,0.85225,1.002366
2018-04-26 00:20:03,,,
2018-04-26 00:20:04,0.717289,0.852253,1.002383
2018-04-26 00:20:05,0.717291,0.852253,1.002392
2018-04-26 00:20:06,,,
2018-04-26 00:20:07,0.717298,0.852256,1.002407
2018-04-26 00:20:08,0.7173,0.852253,1.002391
2018-04-26 00:20:09,,,


We can apply imputation on a dataframe through 3 different methods:
- `"mean"` → replaces NaNs with the column mean  
- `"median"` → replaces NaNs with the column median  
- `"constant"` → replaces NaNs with a constant value provided by the user  

Also, we may selectively apply our imputation on desired `columns`, if desired; if not provided, all numeric columns are imputed;

In [3]:
from ThreeWToolkit.core.base_preprocessing import ImputeMissingConfig
from ThreeWToolkit.preprocessing import ImputeMissing

impute_missing = ImputeMissing(ImputeMissingConfig(
    strategy="mean",
    #columns=["P-PDG"], # uncomment if you want to select only a few
))

# run it
df_pre = impute_missing.pre_process(sample_with_missing)
df_imputed = impute_missing.run(df_pre)
imputed_df = impute_missing.post_process(df_imputed)

imputed_df.head(10)

Unnamed: 0_level_0,P-PDG,T-TPT,P-TPT
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-04-26 00:20:00,0.420202,0.834188,0.686739
2018-04-26 00:20:01,0.717285,0.85225,1.002358
2018-04-26 00:20:02,0.717287,0.85225,1.002366
2018-04-26 00:20:03,0.420202,0.834188,0.686739
2018-04-26 00:20:04,0.717289,0.852253,1.002383
2018-04-26 00:20:05,0.717291,0.852253,1.002392
2018-04-26 00:20:06,0.420202,0.834188,0.686739
2018-04-26 00:20:07,0.717298,0.852256,1.002407
2018-04-26 00:20:08,0.7173,0.852253,1.002391
2018-04-26 00:20:09,0.420202,0.834188,0.686739


In [4]:
impute_missing = ImputeMissing(
    ImputeMissingConfig(strategy="median")
)

df_pre = impute_missing.pre_process(sample_with_missing)
df_imputed = impute_missing.run(df_pre)
imputed_df = impute_missing.post_process(df_imputed)

imputed_df.head(10)

Unnamed: 0_level_0,P-PDG,T-TPT,P-TPT
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-04-26 00:20:00,0.385949,0.832506,0.649148
2018-04-26 00:20:01,0.717285,0.85225,1.002358
2018-04-26 00:20:02,0.717287,0.85225,1.002366
2018-04-26 00:20:03,0.385949,0.832506,0.649148
2018-04-26 00:20:04,0.717289,0.852253,1.002383
2018-04-26 00:20:05,0.717291,0.852253,1.002392
2018-04-26 00:20:06,0.385949,0.832506,0.649148
2018-04-26 00:20:07,0.717298,0.852256,1.002407
2018-04-26 00:20:08,0.7173,0.852253,1.002391
2018-04-26 00:20:09,0.385949,0.832506,0.649148


In [5]:
impute_missing = ImputeMissing(
    ImputeMissingConfig(strategy="constant", fill_value=0.0)
)

df_pre = impute_missing.pre_process(sample_with_missing)
df_imputed = impute_missing.run(df_pre)
imputed_df = impute_missing.post_process(df_imputed)

imputed_df.head(10)

Unnamed: 0_level_0,P-PDG,T-TPT,P-TPT
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-04-26 00:20:00,0.0,0.0,0.0
2018-04-26 00:20:01,0.717285,0.85225,1.002358
2018-04-26 00:20:02,0.717287,0.85225,1.002366
2018-04-26 00:20:03,0.0,0.0,0.0
2018-04-26 00:20:04,0.717289,0.852253,1.002383
2018-04-26 00:20:05,0.717291,0.852253,1.002392
2018-04-26 00:20:06,0.0,0.0,0.0
2018-04-26 00:20:07,0.717298,0.852256,1.002407
2018-04-26 00:20:08,0.7173,0.852253,1.002391
2018-04-26 00:20:09,0.0,0.0,0.0


### Normalization

We may normalize columns/rows using three different strategies:
  - `"l1"` → normalization using the L1 norm (sum of absolute values = 1)  
  - `"l2"` → normalization using the L2 norm (Euclidean norm = 1)  
  - `"max"` → normalization by dividing by the maximum absolute value

Setting `axes=0` normalizes along columns, and `axes=1` along rows.

The normalization routine also accepts a flag, `return_norm_values` (default = `False`) to also return the computed norms.  

In [6]:
from ThreeWToolkit.core.base_preprocessing import NormalizeConfig
from ThreeWToolkit.preprocessing import Normalize


normalize_step = Normalize(NormalizeConfig(norm="max", axis=0, return_norm_values=True))

df_pre = normalize_step.pre_process(sample)
normed_array = normalize_step.run(df_pre)
normed, norms = normalize_step.post_process(normed_array)

normed.tail()

Unnamed: 0_level_0,P-PDG,T-TPT,P-TPT
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-04-26 08:28:14,0.538049,0.976726,0.647569
2018-04-26 08:28:15,0.538054,0.976719,0.647534
2018-04-26 08:28:16,0.538056,0.976719,0.647545
2018-04-26 08:28:17,0.538054,0.976726,0.647567
2018-04-26 08:28:18,0.538053,0.976719,0.647538


## Windowing

We can also segment a 1D time-series into overlapping windows and apply windowing functions over the series:  

The windowing tools accept:  
- A Pandas **Series** (`X`) representing the input 1D signal  
- A window function (`window`, default = `"hann"`):  
  - A string for standard windows (e.g., `"hann"`, `"hamming"`)  
  - A tuple for parametrized windows (e.g., `("kaiser", beta)`)  
- A window size (`window_size`, default = 4): number of samples per window  
- An overlap ratio (`overlap`, default = 0.0): must be in the range `[0, 1)`  
- A flag (`normalize`, default = False) to normalize the window function to unit area  
- A flag (`fftbins`, default = True) to generate FFT-compatible windows (window symmetry)
- A flag (`pad_last_window`, default = False) to pad the last incomplete window with constant values  
- A padding value (`pad_value`, default = 0.0) used when `pad_last_window=True`

The output is a **Dataframe** where each row contains the rolling windows after multiplication by the window shape:

Let's try it!

In [7]:
from ThreeWToolkit.core.base_preprocessing import WindowingConfig
from ThreeWToolkit.preprocessing import Windowing


# Simple time series
series = sample["T-TPT"] # selecting a single column on a dataframe returns a Series

# Applying windowing without overlap
windowing_step = Windowing(WindowingConfig(window="hamming", window_size=64, overlap=0.0))


windowed = windowing_step.pre_process(series)
windowed = windowing_step.run(windowed)
windowed = windowing_step.post_process(windowed)

windowed

Unnamed: 0,var1_t0,var1_t1,var1_t2,var1_t3,var1_t4,var1_t5,var1_t6,var1_t7,var1_t8,var1_t9,...,var1_t55,var1_t56,var1_t57,var1_t58,var1_t59,var1_t60,var1_t61,var1_t62,var1_t63,win
0,0.068180,0.070068,0.075713,0.085061,0.098022,0.114471,0.134250,0.157169,0.183005,0.211511,...,0.211511,0.183004,0.157168,0.134250,0.114471,0.098021,0.085061,0.075713,0.070068,1
1,0.068180,0.070068,0.075713,0.085061,0.098022,0.114471,0.134250,0.157168,0.183005,0.211511,...,0.211511,0.183004,0.157168,0.134250,0.114471,0.098021,0.085061,0.075713,0.070068,2
2,0.068180,0.070068,0.075713,0.085061,0.098022,0.114471,0.134250,0.157168,0.183005,0.211510,...,0.211511,0.183004,0.157168,0.134250,0.114471,0.098022,0.085061,0.075713,0.070068,3
3,0.068180,0.070067,0.075713,0.085061,0.098022,0.114471,0.134250,0.157168,0.183005,0.211511,...,0.211511,0.183004,0.157168,0.134250,0.114471,0.098022,0.085061,0.075713,0.070067,4
4,0.068180,0.070067,0.075713,0.085061,0.098022,0.114471,0.134250,0.157168,0.183004,0.211511,...,0.211510,0.183005,0.157168,0.134250,0.114471,0.098022,0.085061,0.075713,0.070067,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
452,0.066594,0.068437,0.073951,0.083082,0.095742,0.111807,0.131125,0.153513,0.178745,0.206588,...,0.206588,0.178745,0.153509,0.131124,0.111808,0.095742,0.083082,0.073952,0.068438,453
453,0.066594,0.068439,0.073952,0.083083,0.095743,0.111808,0.131127,0.153512,0.178748,0.206590,...,0.206586,0.178744,0.153510,0.131127,0.111809,0.095742,0.083081,0.073951,0.068438,454
454,0.066594,0.068438,0.073953,0.083083,0.095742,0.111809,0.131128,0.153512,0.178749,0.206591,...,0.206588,0.178745,0.153510,0.131124,0.111807,0.095741,0.083081,0.073952,0.068438,455
455,0.066594,0.068438,0.073951,0.083082,0.095741,0.111808,0.131127,0.153510,0.178749,0.206592,...,0.206590,0.178749,0.153512,0.131127,0.111808,0.095742,0.083081,0.073951,0.068437,456


In [8]:
# Applying windowing with 50% overlap
windowing_step = Windowing(WindowingConfig(window="hamming", window_size=64, overlap=0.5))


windowed = windowing_step.pre_process(series)
windowed = windowing_step.run(windowed)
windowed = windowing_step.post_process(windowed)

windowed

Unnamed: 0,var1_t0,var1_t1,var1_t2,var1_t3,var1_t4,var1_t5,var1_t6,var1_t7,var1_t8,var1_t9,...,var1_t55,var1_t56,var1_t57,var1_t58,var1_t59,var1_t60,var1_t61,var1_t62,var1_t63,win
0,0.068180,0.070068,0.075713,0.085061,0.098022,0.114471,0.134250,0.157169,0.183005,0.211511,...,0.211511,0.183004,0.157168,0.134250,0.114471,0.098021,0.085061,0.075713,0.070068,1
1,0.068180,0.070068,0.075713,0.085061,0.098022,0.114471,0.134249,0.157169,0.183004,0.211511,...,0.211511,0.183004,0.157168,0.134250,0.114471,0.098022,0.085061,0.075713,0.070068,2
2,0.068180,0.070068,0.075713,0.085061,0.098022,0.114471,0.134250,0.157168,0.183005,0.211511,...,0.211511,0.183004,0.157168,0.134250,0.114471,0.098021,0.085061,0.075713,0.070068,3
3,0.068180,0.070068,0.075713,0.085061,0.098022,0.114471,0.134250,0.157167,0.183006,0.211510,...,0.211511,0.183004,0.157168,0.134250,0.114471,0.098022,0.085061,0.075713,0.070068,4
4,0.068180,0.070068,0.075713,0.085061,0.098022,0.114471,0.134250,0.157168,0.183005,0.211510,...,0.211511,0.183004,0.157168,0.134250,0.114471,0.098022,0.085061,0.075713,0.070068,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
909,0.066594,0.068438,0.073951,0.083082,0.095742,0.111808,0.131127,0.153510,0.178748,0.206590,...,0.206588,0.178745,0.153511,0.131125,0.111806,0.095740,0.083081,0.073952,0.068438,910
910,0.066594,0.068438,0.073951,0.083082,0.095741,0.111808,0.131127,0.153510,0.178749,0.206592,...,0.206590,0.178749,0.153512,0.131127,0.111808,0.095742,0.083081,0.073951,0.068437,911
911,0.066593,0.068437,0.073952,0.083081,0.095741,0.111807,0.131125,0.153509,0.178745,0.206588,...,0.206591,0.178747,0.153512,0.131127,0.111808,0.095741,0.083081,0.073951,0.068437,912
912,0.066593,0.068437,0.073951,0.083081,0.095740,0.111807,0.131125,0.153510,0.178749,0.206587,...,0.206591,0.178749,0.153512,0.131128,0.111809,0.095742,0.083082,0.073952,0.068438,913


Note how with overlapping windows, we get more rows in the output dataframe, as we compute the windows in more locations.

## Column renaming

Last, but not least, we might want to rename some columns in our pipeline, so they best reflect their meaning as we transform our data.

To rename columns, all that we need is
- A dictionary (`columns_map`) containing old names as keys and new names as values.

In [9]:
from ThreeWToolkit.core.base_preprocessing import RenameColumnsConfig
from ThreeWToolkit.preprocessing import RenameColumns

renaming = {
    "P-PDG": "sensor_A",
    "P-TPT": "sensor_B"
}
rename_step = RenameColumns(RenameColumnsConfig(columns_map=renaming))

renamed = rename_step.pre_process(sample)
renamed = rename_step.run(renamed)
renamed = rename_step.post_process(renamed)


renamed.head(10)

Unnamed: 0_level_0,sensor_A,T-TPT,sensor_B
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-04-26 00:20:00,0.717287,0.85225,1.002349
2018-04-26 00:20:01,0.717285,0.85225,1.002358
2018-04-26 00:20:02,0.717287,0.85225,1.002366
2018-04-26 00:20:03,0.717288,0.852253,1.002375
2018-04-26 00:20:04,0.717289,0.852253,1.002383
2018-04-26 00:20:05,0.717291,0.852253,1.002392
2018-04-26 00:20:06,0.717294,0.852253,1.002402
2018-04-26 00:20:07,0.717298,0.852256,1.002407
2018-04-26 00:20:08,0.7173,0.852253,1.002391
2018-04-26 00:20:09,0.717304,0.852253,1.002383


In [10]:
config = ParquetDatasetConfig(
    path="./dataset",
    clean_data=True, # Let's use the cleaned up version!
)
dataset = ParquetDataset(config)
sample = dataset[0]

[ParquetDataset] Dataset found at ./dataset
[ParquetDataset] Validating dataset integrity...
[ParquetDataset] Dataset integrity check passed!


## Next Steps

🎉 **Nice!** Now you can use the **3W Toolkit** preprocessing tools!

### What's Next?

1. **Feature Extraction**: Discover advanced features in [Notebook 5: Feature Extraction](5_feature_extraction.ipynb)
2. **Visualization**: Discover how to visualize processed data in [Notebook 6: Data Visualization](6_data_visualization.ipynb)
---


---

**📚 Tutorial Navigation:**
- **Previous**: [3. Dataset Download](3_download_dataset.ipynb)
- **Next**: [5. Feature Extraction](5_feature_extraction.ipynb)

**🔗 Additional Resources:**
- [3W Project Repository](https://github.com/petrobras/3W)
- [3W Dataset on Figshare](https://figshare.com/projects/3W_Dataset/251195)
- [Workshop Registration](https://forms.gle/cmLa2u4VaXd1T7qp8)
