# 4. Preprocessing

**Quick overview of the preprocessing features**

This notebook provides instructions to use the preprocessing tools of the **3WToolkit v2.0.0**.

These tools complement the canonical data-cleanup, and may be used instead, or in combination with, if the user desires to tackle the problems in a different approach.

The preprocessing tools in this package operate at event-level.

## ðŸ“‹ Table of Contents

1. [Data preparation](#Data-preparation)
   1. [Imputation](#Handling-missing-data)
   2. [Normalization](#Normalization)
   3. [Windowing](#Windowing)
   4. [Renaming](#Column-renaming)
2. [Next Steps](#Next-Steps)
---


## Data preparation

### Handling missing data

Handling missing data is one of the most common tasks when dealing with datasets such as the 3W Dataset. The **3W Toolkit v2.0.0** is shipped with some strategies for missing data imputation.

Let's check them out.

We begin by creating some mock data:

In [1]:
from ThreeWToolkit.dataset import ParquetDataset, ParquetDatasetConfig

config = ParquetDatasetConfig(path="../../dataset", clean_data=True)
dataset = ParquetDataset(config)

sample = dataset[0]["signal"][["P-PDG", "T-TPT", "P-TPT"]].copy()
sample.head(10)

[ParquetDataset] Dataset found at ../../dataset
[ParquetDataset] Validating dataset integrity...
[ParquetDataset] Dataset integrity check passed!


Unnamed: 0_level_0,P-PDG,T-TPT,P-TPT
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-02-01 01:02:07,0.0,0.726246,-0.663171
2017-02-01 01:02:08,0.0,0.726246,-0.663171
2017-02-01 01:02:09,0.0,0.726246,-0.663171
2017-02-01 01:02:10,0.0,0.726246,-0.663171
2017-02-01 01:02:11,0.0,0.726246,-0.663171
2017-02-01 01:02:12,0.0,0.726246,-0.663171
2017-02-01 01:02:13,0.0,0.726246,-0.663171
2017-02-01 01:02:14,0.0,0.726246,-0.663171
2017-02-01 01:02:15,0.0,0.726246,-0.663171
2017-02-01 01:02:16,0.0,0.726246,-0.663171


Lets simulate some missing data in non-empty columns:

In [2]:
import pandas as pd

sample_with_missing = sample.copy() # Let's copy a slice of the dataset
sample_with_missing.iloc[::3] = pd.NA # simulate missing

sample_with_missing.head(10)

Unnamed: 0_level_0,P-PDG,T-TPT,P-TPT
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-02-01 01:02:07,,,
2017-02-01 01:02:08,0.0,0.726246,-0.663171
2017-02-01 01:02:09,0.0,0.726246,-0.663171
2017-02-01 01:02:10,,,
2017-02-01 01:02:11,0.0,0.726246,-0.663171
2017-02-01 01:02:12,0.0,0.726246,-0.663171
2017-02-01 01:02:13,,,
2017-02-01 01:02:14,0.0,0.726246,-0.663171
2017-02-01 01:02:15,0.0,0.726246,-0.663171
2017-02-01 01:02:16,,,


We can apply imputation on a dataframe through 3 different methods:
- `"mean"` â†’ replaces NaNs with the column mean  
- `"median"` â†’ replaces NaNs with the column median  
- `"constant"` â†’ replaces NaNs with a constant value provided by the user  

Also, we may selectively apply our imputation on desired `columns`, if desired; if not provided, all numeric columns are imputed;

In [3]:
from ThreeWToolkit.core.base_preprocessing import ImputeMissingConfig
from ThreeWToolkit.preprocessing import ImputeMissing

impute_missing = ImputeMissing(ImputeMissingConfig(
    strategy="mean",
    #columns=["P-PDG"], # uncomment if you want to select only a few
))

# run it
df_pre = impute_missing.pre_process(sample_with_missing)
df_imputed = impute_missing.run(df_pre)
imputed_df = impute_missing.post_process(df_imputed)

imputed_df.head(10)

Unnamed: 0_level_0,P-PDG,T-TPT,P-TPT
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-02-01 01:02:07,0.0,0.725055,-0.665799
2017-02-01 01:02:08,0.0,0.726246,-0.663171
2017-02-01 01:02:09,0.0,0.726246,-0.663171
2017-02-01 01:02:10,0.0,0.725055,-0.665799
2017-02-01 01:02:11,0.0,0.726246,-0.663171
2017-02-01 01:02:12,0.0,0.726246,-0.663171
2017-02-01 01:02:13,0.0,0.725055,-0.665799
2017-02-01 01:02:14,0.0,0.726246,-0.663171
2017-02-01 01:02:15,0.0,0.726246,-0.663171
2017-02-01 01:02:16,0.0,0.725055,-0.665799


In [4]:
impute_missing = ImputeMissing(
    ImputeMissingConfig(strategy="median")
)

df_pre = impute_missing.pre_process(sample_with_missing)
df_imputed = impute_missing.run(df_pre)
imputed_df = impute_missing.post_process(df_imputed)

imputed_df.head(10)

Unnamed: 0_level_0,P-PDG,T-TPT,P-TPT
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-02-01 01:02:07,0.0,0.725264,-0.665395
2017-02-01 01:02:08,0.0,0.726246,-0.663171
2017-02-01 01:02:09,0.0,0.726246,-0.663171
2017-02-01 01:02:10,0.0,0.725264,-0.665395
2017-02-01 01:02:11,0.0,0.726246,-0.663171
2017-02-01 01:02:12,0.0,0.726246,-0.663171
2017-02-01 01:02:13,0.0,0.725264,-0.665395
2017-02-01 01:02:14,0.0,0.726246,-0.663171
2017-02-01 01:02:15,0.0,0.726246,-0.663171
2017-02-01 01:02:16,0.0,0.725264,-0.665395


In [5]:
impute_missing = ImputeMissing(
    ImputeMissingConfig(strategy="constant", fill_value=0.0)
)

df_pre = impute_missing.pre_process(sample_with_missing)
df_imputed = impute_missing.run(df_pre)
imputed_df = impute_missing.post_process(df_imputed)

imputed_df.head(10)

Unnamed: 0_level_0,P-PDG,T-TPT,P-TPT
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-02-01 01:02:07,0.0,0.0,0.0
2017-02-01 01:02:08,0.0,0.726246,-0.663171
2017-02-01 01:02:09,0.0,0.726246,-0.663171
2017-02-01 01:02:10,0.0,0.0,0.0
2017-02-01 01:02:11,0.0,0.726246,-0.663171
2017-02-01 01:02:12,0.0,0.726246,-0.663171
2017-02-01 01:02:13,0.0,0.0,0.0
2017-02-01 01:02:14,0.0,0.726246,-0.663171
2017-02-01 01:02:15,0.0,0.726246,-0.663171
2017-02-01 01:02:16,0.0,0.0,0.0


### Normalization

We may normalize columns/rows using three different strategies:
  - `"l1"` â†’ normalization using the L1 norm (sum of absolute values = 1)  
  - `"l2"` â†’ normalization using the L2 norm (Euclidean norm = 1)  
  - `"max"` â†’ normalization by dividing by the maximum absolute value

Setting `axes=0` normalizes along columns, and `axes=1` along rows.

The normalization routine also accepts a flag, `return_norm_values` (default = `False`) to also return the computed norms.  

In [6]:
from ThreeWToolkit.core.base_preprocessing import NormalizeConfig
from ThreeWToolkit.preprocessing import Normalize


normalize_step = Normalize(NormalizeConfig(norm="max", axis=0, return_norm_values=True))

df_pre = normalize_step.pre_process(sample)
normed_array = normalize_step.run(df_pre)
normed, norms = normalize_step.post_process(normed_array)

normed.tail()

Unnamed: 0_level_0,P-PDG,T-TPT,P-TPT
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-02-01 06:59:56,0.0,0.997612,-0.994992
2017-02-01 06:59:57,0.0,0.997607,-0.994992
2017-02-01 06:59:58,0.0,0.997603,-0.994992
2017-02-01 06:59:59,0.0,0.997598,-0.994992
2017-02-01 07:00:00,0.0,0.997594,-0.994992


## Windowing

We can also segment a 1D time-series into overlapping windows and apply windowing functions over the series:  

The windowing tools accept:  
- A Pandas **Series** (`X`) representing the input 1D signal  
- A window function (`window`, default = `"hann"`):  
  - A string for standard windows (e.g., `"hann"`, `"hamming"`)  
  - A tuple for parametrized windows (e.g., `("kaiser", beta)`)  
- A window size (`window_size`, default = 4): number of samples per window  
- An overlap ratio (`overlap`, default = 0.0): must be in the range `[0, 1)`  
- A flag (`normalize`, default = False) to normalize the window function to unit area  
- A flag (`fftbins`, default = True) to generate FFT-compatible windows (window symmetry)
- A flag (`pad_last_window`, default = False) to pad the last incomplete window with constant values  
- A padding value (`pad_value`, default = 0.0) used when `pad_last_window=True`

The output is a **Dataframe** where each row contains the rolling windows after multiplication by the window shape:

Let's try it!

In [7]:
from ThreeWToolkit.core.base_preprocessing import WindowingConfig
from ThreeWToolkit.preprocessing import Windowing


# Simple time series
series = sample["T-TPT"] # selecting a single column on a dataframe returns a Series

# Applying windowing without overlap
windowing_step = Windowing(WindowingConfig(window="hamming", window_size=64, overlap=0.0))


windowed = windowing_step.pre_process(series)
windowed = windowing_step.run(windowed)
windowed = windowing_step.post_process(windowed)

windowed

Unnamed: 0,var1_t0,var1_t1,var1_t2,var1_t3,var1_t4,var1_t5,var1_t6,var1_t7,var1_t8,var1_t9,...,var1_t55,var1_t56,var1_t57,var1_t58,var1_t59,var1_t60,var1_t61,var1_t62,var1_t63,win
0,0.058100,0.059708,0.064519,0.072485,0.083530,0.097547,0.114401,0.133931,0.155948,0.180239,...,0.180241,0.155949,0.133932,0.114402,0.097548,0.083530,0.072485,0.064519,0.059709,1
1,0.058100,0.059709,0.064520,0.072486,0.083531,0.097548,0.114403,0.133933,0.155950,0.180242,...,0.180243,0.155951,0.133934,0.114404,0.097549,0.083531,0.072486,0.064520,0.059710,2
2,0.058101,0.059710,0.064520,0.072487,0.083532,0.097549,0.114404,0.133934,0.155952,0.180244,...,0.180246,0.155953,0.133936,0.114405,0.097550,0.083533,0.072487,0.064521,0.059711,3
3,0.058102,0.059711,0.064521,0.072487,0.083533,0.097550,0.114406,0.133936,0.155954,0.180246,...,0.180248,0.155955,0.133937,0.114407,0.097551,0.083534,0.072488,0.064522,0.059711,4
4,0.058103,0.059711,0.064522,0.072488,0.083534,0.097551,0.114407,0.133937,0.155956,0.180249,...,0.180251,0.155957,0.133939,0.114408,0.097553,0.083535,0.072489,0.064523,0.059712,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
330,0.057846,0.059447,0.064237,0.072168,0.083164,0.097120,0.113901,0.133345,0.155265,0.179451,...,0.179451,0.155265,0.133345,0.113901,0.097120,0.083164,0.072168,0.064237,0.059447,331
331,0.057846,0.059447,0.064237,0.072168,0.083164,0.097120,0.113901,0.133345,0.155265,0.179451,...,0.179451,0.155265,0.133345,0.113901,0.097120,0.083164,0.072168,0.064237,0.059447,332
332,0.057846,0.059447,0.064237,0.072168,0.083164,0.097120,0.113901,0.133345,0.155265,0.179451,...,0.179764,0.155545,0.133591,0.114117,0.097309,0.083330,0.072315,0.064371,0.059575,333
333,0.057973,0.059581,0.064384,0.072337,0.083363,0.097357,0.114185,0.133684,0.155668,0.179925,...,0.180062,0.155793,0.133798,0.114287,0.097449,0.083445,0.072411,0.064453,0.059647,334


In [8]:
# Applying windowing with 50% overlap
windowing_step = Windowing(WindowingConfig(window="hamming", window_size=64, overlap=0.5))


windowed = windowing_step.pre_process(series)
windowed = windowing_step.run(windowed)
windowed = windowing_step.post_process(windowed)

windowed

Unnamed: 0,var1_t0,var1_t1,var1_t2,var1_t3,var1_t4,var1_t5,var1_t6,var1_t7,var1_t8,var1_t9,...,var1_t55,var1_t56,var1_t57,var1_t58,var1_t59,var1_t60,var1_t61,var1_t62,var1_t63,win
0,0.058100,0.059708,0.064519,0.072485,0.083530,0.097547,0.114401,0.133931,0.155948,0.180239,...,0.180241,0.155949,0.133932,0.114402,0.097548,0.083530,0.072485,0.064519,0.059709,1
1,0.058100,0.059709,0.064519,0.072485,0.083530,0.097547,0.114402,0.133931,0.155948,0.180240,...,0.180242,0.155950,0.133933,0.114403,0.097548,0.083531,0.072486,0.064520,0.059709,2
2,0.058100,0.059709,0.064520,0.072486,0.083531,0.097548,0.114403,0.133933,0.155950,0.180242,...,0.180243,0.155951,0.133934,0.114404,0.097549,0.083531,0.072486,0.064520,0.059710,3
3,0.058101,0.059709,0.064520,0.072486,0.083531,0.097548,0.114403,0.133933,0.155950,0.180242,...,0.180244,0.155952,0.133935,0.114405,0.097550,0.083532,0.072487,0.064521,0.059710,4
4,0.058101,0.059710,0.064520,0.072487,0.083532,0.097549,0.114404,0.133934,0.155952,0.180244,...,0.180246,0.155953,0.133936,0.114405,0.097550,0.083533,0.072487,0.064521,0.059711,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
665,0.057880,0.059486,0.064281,0.072222,0.083230,0.097202,0.114003,0.133471,0.155421,0.179638,...,0.180051,0.155792,0.133804,0.114299,0.097464,0.083458,0.072423,0.064463,0.059656,666
666,0.057973,0.059581,0.064384,0.072337,0.083363,0.097357,0.114185,0.133684,0.155668,0.179925,...,0.180062,0.155793,0.133798,0.114287,0.097449,0.083445,0.072411,0.064453,0.059647,667
667,0.058049,0.059656,0.064462,0.072420,0.083455,0.097459,0.114298,0.133809,0.155805,0.180074,...,0.180033,0.155769,0.133777,0.114269,0.097434,0.083432,0.072400,0.064443,0.059638,668
668,0.058040,0.059647,0.064452,0.072409,0.083442,0.097444,0.114280,0.133789,0.155781,0.180046,...,0.180006,0.155745,0.133756,0.114252,0.097419,0.083420,0.072389,0.064433,0.059629,669


Note how with overlapping windows, we get more rows in the output dataframe, as we compute the windows in more locations.

## Column renaming

Last, but not least, we might want to rename some columns in our pipeline, so they best reflect their meaning as we transform our data.

To rename columns, all that we need is
- A dictionary (`columns_map`) containing old names as keys and new names as values.

In [9]:
from ThreeWToolkit.core.base_preprocessing import RenameColumnsConfig
from ThreeWToolkit.preprocessing import RenameColumns

renaming = {
    "P-PDG": "sensor_A",
    "P-TPT": "sensor_B"
}
rename_step = RenameColumns(RenameColumnsConfig(columns_map=renaming))

renamed = rename_step.pre_process(sample)
renamed = rename_step.run(renamed)
renamed = rename_step.post_process(renamed)


renamed.head(10)

Unnamed: 0_level_0,sensor_A,T-TPT,sensor_B
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2017-02-01 01:02:07,0.0,0.726246,-0.663171
2017-02-01 01:02:08,0.0,0.726246,-0.663171
2017-02-01 01:02:09,0.0,0.726246,-0.663171
2017-02-01 01:02:10,0.0,0.726246,-0.663171
2017-02-01 01:02:11,0.0,0.726246,-0.663171
2017-02-01 01:02:12,0.0,0.726246,-0.663171
2017-02-01 01:02:13,0.0,0.726246,-0.663171
2017-02-01 01:02:14,0.0,0.726246,-0.663171
2017-02-01 01:02:15,0.0,0.726246,-0.663171
2017-02-01 01:02:16,0.0,0.726246,-0.663171


## Next Steps

ðŸŽ‰ **Nice!** Now you can use the **3W Toolkit** preprocessing tools!

### What's Next?

1. **Feature Extraction**: Discover advanced features in [Notebook 5: Feature Extraction](5_feature_extraction.ipynb)
2. **Visualization**: Discover how to visualize processed data in [Notebook 6: Data Visualization](6_data_visualization.ipynb)
---


---

**ðŸ“š Tutorial Navigation:**
- **Previous**: [3. Dataset Download](3_download_dataset.ipynb)
- **Next**: [5. Feature Extraction](5_feature_extraction.ipynb)

**ðŸ”— Additional Resources:**
- [3W Project Repository](https://github.com/petrobras/3W)
- [3W Dataset on Figshare](https://figshare.com/projects/3W_Dataset/251195)
- [Workshop Registration](https://forms.gle/cmLa2u4VaXd1T7qp8)
