# Notebook 5: Get Preprocessed Data (All Datasets)

## Overview

This notebook applies the **complete preprocessing pipeline to ALL bundled FTIR datasets** at once, producing publication-ready preprocessed spectra for the entire Xpectrass dataset collection. It also computes spectral derivatives and combines datasets into unified files.

### What You'll Learn

1. How to batch-process multiple FTIR datasets efficiently
2. How to apply consistent preprocessing across different studies
3. How to compute spectral derivatives (1st and 2nd order)
4. How to combine multiple datasets into a unified format
5. How to save processed data for downstream analysis

### Prerequisites

Before running this notebook, you should have completed Notebooks 1-3 to determine:
- ✓ **Best denoising method** (Notebook 1)
- ✓ **Best baseline correction method** (Notebook 2)
- ✓ **Best normalization method** (Notebook 3)

### Bundled Datasets

Xpectrass includes 6 published FTIR datasets:
- **jung_2018**: Jung et al. (2018) microplastics study
- **kedzierski_2019**: Kedzierski et al. (2019) environmental samples
- **kedzierski_2019_u**: Kedzierski et al. (2019) unknown samples
- **frond_2021**: Frond et al. (2021) polymer characterization
- **villegas_camacho_2024_c4**: Villegas-Camacho et al. (2024) 4 cm⁻¹ resolution
- **villegas_camacho_2024_c8**: Villegas-Camacho et al. (2024) 8 cm⁻¹ resolution

### Preprocessing Pipeline

Each dataset undergoes the same pipeline (order matters!):

1. **Convert** to absorbance
2. **Denoise** to remove random noise
3. **Correct baseline** to remove drift
4. **Handle atmospheric interference** (CO₂, H₂O)
5. **Normalize** to make spectra comparable
6. **Compute derivatives** (optional, for enhanced spectral features)

### Spectral Derivatives

**Why compute derivatives?**
- **1st derivative**: Enhances spectral resolution, highlights subtle peaks
- **2nd derivative**: Further sharpens peaks, useful for overlapping bands
- Derivatives are often used in chemometric analysis and machine learning
- Can improve classification performance in some cases

**Parameters:**
- `order`: 1 for first derivative, 2 for second derivative
- `window_length`: Savitzky-Golay filter window size (15 is typical)
- `polyorder`: Polynomial order for S-G filter (3 is typical)
- `delta`: Spacing between data points (1.0 for unit spacing)

### Expected Output

By the end of this notebook, you'll have:
- **6 preprocessed datasets** (one for each study)
- **6 first derivative datasets**
- **6 second derivative datasets**
- **3 combined files** (normalized, 1st derivative, 2nd derivative)
- **2 combined normalized derivative files**
- All files saved in compressed CSV format for analysis

---

## Step 1: Load All Datasets and Define Parameters

First, we'll load all bundled datasets and configure preprocessing parameters based on your Notebooks 1-3 results.

In [1]:
# Import required modules
import polars as pl
from xpectrass import FTIRdataprocessing
from xpectrass import load_all_datasets, get_data_info

# Load all bundled datasets at once
# This returns a dictionary with dataset names as keys
dataset = load_all_datasets()
info = get_data_info()

# ============================================================================
# CONFIGURATION: Update these parameters based on your Notebooks 1-3 results
# ============================================================================

# Label column (contains polymer type information)
LABEL_COLUMN = "type"

# Flat windows for baseline correction evaluation
FLAT_WINDOWS = [(1880, 1900), (2400, 2700)]

# SELECTED METHOD FROM NOTEBOOK 1
DENOISING_METHOD = 'wavelet'  # Options: 'savgol', 'wavelet', 'gaussian', 'median', etc.

# SELECTED METHOD FROM NOTEBOOK 2
BASELINE_CORRECTION_METHOD = 'aspls'  # Options: 'asls', 'airpls', 'mor', 'snip', etc.

# Atmospheric correction settings
# Define regions to exclude (completely remove from analysis)
EXCLUDE_REGIONS = [
    (0, 679),       # Exclude everything below 680, CO₂ bending mode (670 cm⁻¹)
    (3001, 5000)    # Exclude everything above 3000, O–H stretch region
]

# Define regions to interpolate (replace with baseline)
INTERPOLATE_REGIONS = [
    (1250, 2700)    # Interpolate over H₂O bend + CO₂ stretch regions
]

# Interpolation method for atmospheric regions
INTERPOLATE_METHOD = "zero"  # Options: 'zero', 'linear', 'spline'

# SELECTED METHOD FROM NOTEBOOK 3
NORMALIZATION_METHOD = "snv_detrend"  # Options: 'snv', 'vector', 'minmax', 'area', etc.

# ============================================================================

print("="*80)
print("BATCH PREPROCESSING CONFIGURATION")
print("="*80)
print(f"\nSelected Methods:")
print(f"  1. Denoising:          {DENOISING_METHOD}")
print(f"  2. Baseline:           {BASELINE_CORRECTION_METHOD}")
print(f"  3. Atmospheric:        Interpolate method = {INTERPOLATE_METHOD}")
print(f"  4. Normalization:      {NORMALIZATION_METHOD}")
print(f"\nRegion Settings:")
print(f"  Exclude:     {EXCLUDE_REGIONS}")
print(f"  Interpolate: {INTERPOLATE_REGIONS}")
print(f"  Flat windows: {FLAT_WINDOWS}")
print("="*80 + "\n")

print("Dataset Information:")
print(info)

BATCH PREPROCESSING CONFIGURATION

Selected Methods:
  1. Denoising:          wavelet
  2. Baseline:           aspls
  3. Atmospheric:        Interpolate method = zero
  4. Normalization:      snv_detrend

Region Settings:
  Exclude:     [(0, 679), (3001, 5000)]
  Interpolate: [(1250, 2700)]
  Flat windows: [(1880, 1900), (2400, 2700)]

Dataset Information:
{'jung_2018': {'exists': True, 'path': '/Users/julhashkazi/Documents/PythonScripts/FTIR/scripts/notebooks/.conda/lib/python3.11/site-packages/xpectrass/data/jung_2018.csv.xz', 'filename': 'jung_2018.csv.xz', 'size_mb': 1.5899505615234375}, 'kedzierski_2019': {'exists': True, 'path': '/Users/julhashkazi/Documents/PythonScripts/FTIR/scripts/notebooks/.conda/lib/python3.11/site-packages/xpectrass/data/kedzierski_2019.csv.xz', 'filename': 'kedzierski_2019.csv.xz', 'size_mb': 7.486408233642578}, 'kedzierski_2019_u': {'exists': True, 'path': '/Users/julhashkazi/Documents/PythonScripts/FTIR/scripts/notebooks/.conda/lib/python3.11/site-pack

---

## Step 2: Extract Individual Datasets

Now we'll extract each dataset from the dictionary for processing.

In [2]:
# Extract individual datasets from the dictionary
jung_2018 = dataset['jung_2018']
kedzierski_2019 = dataset['kedzierski_2019']
kedzierski_2019_u = dataset['kedzierski_2019_u']
frond_2021 = dataset['frond_2021']
villegas_camacho_2024_c4 = dataset['villegas_camacho_2024_c4']
villegas_camacho_2024_c8 = dataset['villegas_camacho_2024_c8']

print("="*80)
print("INDIVIDUAL DATASET SHAPES")
print("="*80)
print(f'jung_2018:                 {jung_2018.shape}')
print(f'kedzierski_2019:           {kedzierski_2019.shape}')
print(f'kedzierski_2019_u:         {kedzierski_2019_u.shape}')
print(f'frond_2021:                {frond_2021.shape}')
print(f'villegas_camacho_2024_c4:  {villegas_camacho_2024_c4.shape}')
print(f'villegas_camacho_2024_c8:  {villegas_camacho_2024_c8.shape}')
print("="*80)

INDIVIDUAL DATASET SHAPES
jung_2018:                 (800, 3556)
kedzierski_2019:           (970, 1767)
kedzierski_2019_u:         (4064, 1768)
frond_2021:                (380, 1874)
villegas_camacho_2024_c4:  (3000, 3741)
villegas_camacho_2024_c8:  (3000, 1874)


### Create a processed_data folder to store processed data

In [3]:
# Create processed_data folder if it doesn't exist
import os

processed_data_dir = 'processed_data'
if not os.path.exists(processed_data_dir):
    os.makedirs(processed_data_dir)
    print(f"✓ Created '{processed_data_dir}' folder")
else:
    print(f"✓ '{processed_data_dir}' folder already exists")

✓ Created 'processed_data' folder


---

## Step 3: Process Each Dataset

Now we'll apply the complete preprocessing pipeline to each dataset individually. For each dataset, we will:
1. Apply full preprocessing (denoising → baseline → atmospheric → normalization)
2. Compute 1st derivative spectra
3. Compute 2nd derivative spectra

This may take several minutes depending on dataset sizes.

### Dataset 1: jung_2018

In [4]:
print("\n[1/6] Processing jung_2018...")

# Initialize FTIRdataprocessing class for jung_2018
fdp1 = FTIRdataprocessing(
    df=jung_2018,
    label_column=LABEL_COLUMN,
    exclude_regions=EXCLUDE_REGIONS,
    interpolate_regions=INTERPOLATE_REGIONS,
    flat_windows=FLAT_WINDOWS
)

# Apply full preprocessing pipeline
print("  → Applying full preprocessing (denoise + baseline + atmospheric + normalize)...")
jung_2018_corr = fdp1._get_normalized_data(
    denoising_method=DENOISING_METHOD,
    baseline_correction_method=BASELINE_CORRECTION_METHOD,
    interpolate_method=INTERPOLATE_METHOD,
    normalization_method=NORMALIZATION_METHOD,
    plot=False,
)

# Compute 1st derivative
print("  → Computing 1st derivative spectra...")
jung_2018_deriv1 = fdp1.derivatives(
    data=jung_2018_corr,
    order=1,
    window_length=15,
    polyorder=3,
    delta=1.0,
    plot=False,
    save_plot=False,
    save_path=None,
)

# Compute 2nd derivative
print("  → Computing 2nd derivative spectra...")
jung_2018_deriv2 = fdp1.derivatives(
    data=jung_2018_corr,
    order=2,
    window_length=15,
    polyorder=3,
    delta=1.0,
    plot=False,
    save_plot=False,
    save_path=None,
)

print(f"✓ jung_2018 complete: {jung_2018_corr.shape}")


[1/6] Processing jung_2018...
  → Applying full preprocessing (denoise + baseline + atmospheric + normalize)...
Auto-detected: Transmittance → Converting to Absorbance


Denoising (wavelet): 100%|██████████| 800/800 [00:00<00:00, 11779.23it/s]
Baseline correction (aspls): 100%|██████████| 800/800 [00:18<00:00, 42.49it/s]
Processing Regions: 100%|██████████| 800/800 [00:00<00:00, 10708.19it/s]
Normalization (snv_detrend): 100%|██████████| 800/800 [00:00<00:00, 11272.40it/s]


  → Computing 1st derivative spectra...
Computing 1st derivative for 800 samples...
  → Computing 2nd derivative spectra...
Computing 2nd derivative for 800 samples...
✓ jung_2018 complete: (800, 2326)


### Dataset 2: kedzierski_2019

**Note**: This dataset is already in absorbance mode and pre-processed, so we only compute derivatives.

In [5]:
print("\n[2/6] Processing kedzierski_2019...")

# Initialize FTIRdataprocessing class for kedzierski_2019
fdp2 = FTIRdataprocessing(
    df=kedzierski_2019,
    label_column=LABEL_COLUMN,
    exclude_regions=EXCLUDE_REGIONS,
    interpolate_regions=INTERPOLATE_REGIONS,
    flat_windows=FLAT_WINDOWS
)

# This dataset is already preprocessed, so we just copy it
print("  → Dataset already preprocessed, using as-is...")
kedzierski_2019_corr = kedzierski_2019.copy()

# Compute 1st derivative
print("  → Computing 1st derivative spectra...")
kedzierski_2019_deriv1 = fdp2.derivatives(
    data=kedzierski_2019_corr,
    order=1,
    window_length=15,
    polyorder=3,
    delta=1.0,
    plot=False,
    save_plot=False,
    save_path=None,
)

# Compute 2nd derivative
print("  → Computing 2nd derivative spectra...")
kedzierski_2019_deriv2 = fdp2.derivatives(
    data=kedzierski_2019_corr,
    order=2,
    window_length=15,
    polyorder=3,
    delta=1.0,
    plot=False,
    save_plot=False,
    save_path=None,
)

print(f"✓ kedzierski_2019 complete: {kedzierski_2019_corr.shape}")


[2/6] Processing kedzierski_2019...
  → Dataset already preprocessed, using as-is...
  → Computing 1st derivative spectra...
Computing 1st derivative for 970 samples...
  → Computing 2nd derivative spectra...
Computing 2nd derivative for 970 samples...
✓ kedzierski_2019 complete: (970, 1767)


### Dataset 3: kedzierski_2019_u (Unknown samples)

**Note**: This dataset is also already preprocessed.

In [6]:
print("\n[3/6] Processing kedzierski_2019_u...")

# Initialize FTIRdataprocessing class for kedzierski_2019_u
fdp3 = FTIRdataprocessing(
    df=kedzierski_2019_u,
    label_column=LABEL_COLUMN,
    exclude_regions=EXCLUDE_REGIONS,
    interpolate_regions=INTERPOLATE_REGIONS,
    flat_windows=FLAT_WINDOWS
)

# This dataset is already preprocessed, so we just copy it
print("  → Dataset already preprocessed, using as-is...")
kedzierski_2019_u_corr = kedzierski_2019_u.copy()

# Compute 1st derivative
print("  → Computing 1st derivative spectra...")
kedzierski_2019_u_deriv1 = fdp3.derivatives(
    data=kedzierski_2019_u_corr,
    order=1,
    window_length=15,
    polyorder=3,
    delta=1.0,
    plot=False,
    save_plot=False,
    save_path=None,
)

# Compute 2nd derivative
print("  → Computing 2nd derivative spectra...")
kedzierski_2019_u_deriv2 = fdp3.derivatives(
    data=kedzierski_2019_u_corr,
    order=2,
    window_length=15,
    polyorder=3,
    delta=1.0,
    plot=False,
    save_plot=False,
    save_path=None,
)

print(f"✓ kedzierski_2019_u complete: {kedzierski_2019_u_corr.shape}")


[3/6] Processing kedzierski_2019_u...
  → Dataset already preprocessed, using as-is...
  → Computing 1st derivative spectra...
Computing 1st derivative for 4064 samples...
  → Computing 2nd derivative spectra...
Computing 2nd derivative for 4064 samples...
✓ kedzierski_2019_u complete: (4064, 1768)


### Dataset 4: frond_2021

In [7]:
print("\n[4/6] Processing frond_2021...")

# Initialize FTIRdataprocessing class for frond_2021
fdp4 = FTIRdataprocessing(
    df=frond_2021,
    label_column=LABEL_COLUMN,
    exclude_regions=EXCLUDE_REGIONS,
    interpolate_regions=INTERPOLATE_REGIONS,
    flat_windows=FLAT_WINDOWS
)

# Apply full preprocessing pipeline
print("  → Applying full preprocessing (denoise + baseline + atmospheric + normalize)...")
frond_2021_corr = fdp4._get_normalized_data(
    denoising_method=DENOISING_METHOD,
    baseline_correction_method=BASELINE_CORRECTION_METHOD,
    interpolate_method=INTERPOLATE_METHOD,
    normalization_method=NORMALIZATION_METHOD,
    plot=False,
)

# Compute 1st derivative
print("  → Computing 1st derivative spectra...")
frond_2021_deriv1 = fdp4.derivatives(
    data=frond_2021_corr,
    order=1,
    window_length=15,
    polyorder=3,
    delta=1.0,
    plot=False,
    save_plot=False,
    save_path=None,
)

# Compute 2nd derivative
print("  → Computing 2nd derivative spectra...")
frond_2021_deriv2 = fdp4.derivatives(
    data=frond_2021_corr,
    order=2,
    window_length=15,
    polyorder=3,
    delta=1.0,
    plot=False,
    save_plot=False,
    save_path=None,
)

print(f"✓ frond_2021 complete: {frond_2021_corr.shape}")

Found 0 negative and 54720 zero transmittance values. These are physically invalid and will be clipped to 0.01% for conversion. This indicates data quality issues in the input.



[4/6] Processing frond_2021...
  → Applying full preprocessing (denoise + baseline + atmospheric + normalize)...
Auto-detected: Transmittance → Converting to Absorbance


Denoising (wavelet): 100%|██████████| 380/380 [00:00<00:00, 17757.03it/s]
Baseline correction (aspls): 100%|██████████| 380/380 [00:02<00:00, 153.22it/s]
Processing Regions: 100%|██████████| 380/380 [00:00<00:00, 16745.84it/s]
Normalization (snv_detrend): 100%|██████████| 380/380 [00:00<00:00, 15526.14it/s]

  → Computing 1st derivative spectra...
Computing 1st derivative for 380 samples...
  → Computing 2nd derivative spectra...
Computing 2nd derivative for 380 samples...
✓ frond_2021 complete: (380, 1209)





### Dataset 5: villegas_camacho_2024_c4

In [8]:
print("\n[5/6] Processing villegas_camacho_2024_c4...")

# Initialize FTIRdataprocessing class for villegas_camacho_2024_c4
fdp5 = FTIRdataprocessing(
    df=villegas_camacho_2024_c4,
    label_column=LABEL_COLUMN,
    exclude_regions=EXCLUDE_REGIONS,
    interpolate_regions=INTERPOLATE_REGIONS,
    flat_windows=FLAT_WINDOWS
)

# Apply full preprocessing pipeline
print("  → Applying full preprocessing (denoise + baseline + atmospheric + normalize)...")
villegas_camacho_2024_c4_corr = fdp5._get_normalized_data(
    denoising_method=DENOISING_METHOD,
    baseline_correction_method=BASELINE_CORRECTION_METHOD,
    interpolate_method=INTERPOLATE_METHOD,
    normalization_method=NORMALIZATION_METHOD,
    plot=False,
)

# Compute 1st derivative
print("  → Computing 1st derivative spectra...")
villegas_camacho_2024_c4_deriv1 = fdp5.derivatives(
    data=villegas_camacho_2024_c4_corr,
    order=1,
    window_length=15,
    polyorder=3,
    delta=1.0,
    plot=False,
    save_plot=False,
    save_path=None,
)

# Compute 2nd derivative
print("  → Computing 2nd derivative spectra...")
villegas_camacho_2024_c4_deriv2 = fdp5.derivatives(
    data=villegas_camacho_2024_c4_corr,
    order=2,
    window_length=15,
    polyorder=3,
    delta=1.0,
    plot=False,
    save_plot=False,
    save_path=None,
)

print(f"✓ villegas_camacho_2024_c4 complete: {villegas_camacho_2024_c4_corr.shape}")


[5/6] Processing villegas_camacho_2024_c4...
  → Applying full preprocessing (denoise + baseline + atmospheric + normalize)...
Auto-detected: Transmittance → Converting to Absorbance


Denoising (wavelet): 100%|██████████| 3000/3000 [00:00<00:00, 11205.07it/s]
Baseline correction (aspls): 100%|██████████| 3000/3000 [01:19<00:00, 37.80it/s]
Processing Regions: 100%|██████████| 3000/3000 [00:00<00:00, 9109.60it/s]
Normalization (snv_detrend): 100%|██████████| 3000/3000 [00:00<00:00, 10857.37it/s]


  → Computing 1st derivative spectra...
Computing 1st derivative for 3000 samples...
  → Computing 2nd derivative spectra...
Computing 2nd derivative for 3000 samples...
✓ villegas_camacho_2024_c4 complete: (3000, 2413)


### Dataset 6: villegas_camacho_2024_c8

In [9]:
print("\n[6/6] Processing villegas_camacho_2024_c8...")

# Initialize FTIRdataprocessing class for villegas_camacho_2024_c8
fdp6 = FTIRdataprocessing(
    df=villegas_camacho_2024_c8,
    label_column=LABEL_COLUMN,
    exclude_regions=EXCLUDE_REGIONS,
    interpolate_regions=INTERPOLATE_REGIONS,
    flat_windows=FLAT_WINDOWS
)

# Apply full preprocessing pipeline
print("  → Applying full preprocessing (denoise + baseline + atmospheric + normalize)...")
villegas_camacho_2024_c8_corr = fdp6._get_normalized_data(
    denoising_method=DENOISING_METHOD,
    baseline_correction_method=BASELINE_CORRECTION_METHOD,
    interpolate_method=INTERPOLATE_METHOD,
    normalization_method=NORMALIZATION_METHOD,
    plot=False,
)

# Compute 1st derivative
print("  → Computing 1st derivative spectra...")
villegas_camacho_2024_c8_deriv1 = fdp6.derivatives(
    data=villegas_camacho_2024_c8_corr,
    order=1,
    window_length=15,
    polyorder=3,
    delta=1.0,
    plot=False,
    save_plot=False,
    save_path=None,
)

# Compute 2nd derivative
print("  → Computing 2nd derivative spectra...")
villegas_camacho_2024_c8_deriv2 = fdp6.derivatives(
    data=villegas_camacho_2024_c8_corr,
    order=2,
    window_length=15,
    polyorder=3,
    delta=1.0,
    plot=False,
    save_plot=False,
    save_path=None,
)

print(f"✓ villegas_camacho_2024_c8 complete: {villegas_camacho_2024_c8_corr.shape}")
print("\n" + "="*80)
print("ALL DATASETS PROCESSED SUCCESSFULLY")
print("="*80)


[6/6] Processing villegas_camacho_2024_c8...
  → Applying full preprocessing (denoise + baseline + atmospheric + normalize)...
Auto-detected: Transmittance → Converting to Absorbance


Denoising (wavelet): 100%|██████████| 3000/3000 [00:00<00:00, 17523.90it/s]
Baseline correction (aspls): 100%|██████████| 3000/3000 [00:44<00:00, 67.70it/s]
Processing Regions: 100%|██████████| 3000/3000 [00:00<00:00, 16536.77it/s]
Normalization (snv_detrend): 100%|██████████| 3000/3000 [00:00<00:00, 15719.82it/s]


  → Computing 1st derivative spectra...
Computing 1st derivative for 3000 samples...
  → Computing 2nd derivative spectra...
Computing 2nd derivative for 3000 samples...
✓ villegas_camacho_2024_c8 complete: (3000, 1209)

ALL DATASETS PROCESSED SUCCESSFULLY


---

## Step 4: Clean Up Metadata Columns

Remove the 'study' column from preprocessed datasets before combining (it will be re-added during combination with proper study names).

In [10]:
print("\n" + "="*80)
print("CLEANING UP METADATA COLUMNS")
print("="*80)
print("Removing 'study' column from all datasets (will be re-added during combination)...\n")

# Remove 'study' column from all corrected datasets
jung_2018_corr.drop(columns=['study'], inplace=True)
kedzierski_2019_corr.drop(columns=['study'], inplace=True)
kedzierski_2019_u_corr.drop(columns=['study'], inplace=True)
frond_2021_corr.drop(columns=['study'], inplace=True)
villegas_camacho_2024_c4_corr.drop(columns=['study'], inplace=True)
villegas_camacho_2024_c8_corr.drop(columns=['study'], inplace=True)

print("✓ Cleanup complete")


CLEANING UP METADATA COLUMNS
Removing 'study' column from all datasets (will be re-added during combination)...

✓ Cleanup complete


In [11]:
print("\n" + "="*80)
print("PREPROCESSED DATA SHAPES (normalized)")
print("="*80)
print(f'jung_2018:                 {jung_2018_corr.shape}')
print(f'kedzierski_2019:           {kedzierski_2019_corr.shape}')
print(f'kedzierski_2019_u:         {kedzierski_2019_u_corr.shape}')
print(f'frond_2021:                {frond_2021_corr.shape}')
print(f'villegas_camacho_2024_c4:  {villegas_camacho_2024_c4_corr.shape}')
print(f'villegas_camacho_2024_c8:  {villegas_camacho_2024_c8_corr.shape}')
print("="*80)


PREPROCESSED DATA SHAPES (normalized)
jung_2018:                 (800, 2325)
kedzierski_2019:           (970, 1766)
kedzierski_2019_u:         (4064, 1767)
frond_2021:                (380, 1208)
villegas_camacho_2024_c4:  (3000, 2412)
villegas_camacho_2024_c8:  (3000, 1208)


In [12]:
print("\nRemoving 'study' column from 1st derivative datasets...\n")

# Remove 'study' column from all 1st derivative datasets
jung_2018_deriv1.drop(columns=['study'], inplace=True)
kedzierski_2019_deriv1.drop(columns=['study'], inplace=True)
kedzierski_2019_u_deriv1.drop(columns=['study'], inplace=True)
frond_2021_deriv1.drop(columns=['study'], inplace=True)
villegas_camacho_2024_c4_deriv1.drop(columns=['study'], inplace=True)
villegas_camacho_2024_c8_deriv1.drop(columns=['study'], inplace=True)

print("✓ Cleanup complete for 1st derivatives")


Removing 'study' column from 1st derivative datasets...

✓ Cleanup complete for 1st derivatives


In [13]:
print("\n" + "="*80)
print("1ST DERIVATIVE DATA SHAPES")
print("="*80)
print(f'jung_2018:                 {jung_2018_deriv1.shape}')
print(f'kedzierski_2019:           {kedzierski_2019_deriv1.shape}')
print(f'kedzierski_2019_u:         {kedzierski_2019_u_deriv1.shape}')
print(f'frond_2021:                {frond_2021_deriv1.shape}')
print(f'villegas_camacho_2024_c4:  {villegas_camacho_2024_c4_deriv1.shape}')
print(f'villegas_camacho_2024_c8:  {villegas_camacho_2024_c8_deriv1.shape}')
print("="*80)


1ST DERIVATIVE DATA SHAPES
jung_2018:                 (800, 2325)
kedzierski_2019:           (970, 1766)
kedzierski_2019_u:         (4064, 1767)
frond_2021:                (380, 1208)
villegas_camacho_2024_c4:  (3000, 2412)
villegas_camacho_2024_c8:  (3000, 1208)


In [14]:
print("\nRemoving 'study' column from 2nd derivative datasets...\n")

# Remove 'study' column from all 2nd derivative datasets
jung_2018_deriv2.drop(columns=['study'], inplace=True)
kedzierski_2019_deriv2.drop(columns=['study'], inplace=True)
kedzierski_2019_u_deriv2.drop(columns=['study'], inplace=True)
frond_2021_deriv2.drop(columns=['study'], inplace=True)
villegas_camacho_2024_c4_deriv2.drop(columns=['study'], inplace=True)
villegas_camacho_2024_c8_deriv2.drop(columns=['study'], inplace=True)

print("✓ Cleanup complete for 2nd derivatives")


Removing 'study' column from 2nd derivative datasets...

✓ Cleanup complete for 2nd derivatives


In [15]:
print("\n" + "="*80)
print("2ND DERIVATIVE DATA SHAPES")
print("="*80)
print(f'jung_2018:                 {jung_2018_deriv2.shape}')
print(f'kedzierski_2019:           {kedzierski_2019_deriv2.shape}')
print(f'kedzierski_2019_u:         {kedzierski_2019_u_deriv2.shape}')
print(f'frond_2021:                {frond_2021_deriv2.shape}')
print(f'villegas_camacho_2024_c4:  {villegas_camacho_2024_c4_deriv2.shape}')
print(f'villegas_camacho_2024_c8:  {villegas_camacho_2024_c8_deriv2.shape}')
print("="*80)


2ND DERIVATIVE DATA SHAPES
jung_2018:                 (800, 2325)
kedzierski_2019:           (970, 1766)
kedzierski_2019_u:         (4064, 1767)
frond_2021:                (380, 1208)
villegas_camacho_2024_c4:  (3000, 2412)
villegas_camacho_2024_c8:  (3000, 1208)


---

## Step 5: Combine All Datasets

Now we'll combine all individual datasets into unified files. The `combine_datasets()` function:
- Interpolates all spectra to a common wavenumber grid
- Ensures consistent resolution across all studies
- Adds study name column for tracking dataset origin
- Handles different wavenumber ranges automatically

### Parameters Explained:

- **wn_min, wn_max**: Define the common wavenumber range (680-3000 cm⁻¹)
- **resolution**: Common resolution for all spectra (2.0 cm⁻¹)
- **descending**: Wavenumber order (True = high to low, typical for FTIR)
- **method**: Interpolation method ("pchip" = Piecewise Cubic Hermite Interpolating Polynomial, recommended for smooth data)
- **add_study_column**: Metadata columns to preserve
- **study_names**: Names for each dataset (for tracking)
- **n_jobs**: Number of parallel jobs for faster processing

### Combine Normalized Data

In [16]:
print("\n" + "="*80)
print("COMBINING NORMALIZED DATASETS")
print("="*80)
print("This will interpolate all datasets to a common wavenumber grid...")
print("Parameters: wn_range=(680, 3000), resolution=2.0 cm⁻¹, method='pchip'\n")

from xpectrass import combine_datasets

# Combine all normalized datasets into a single file
combined_norm_data, _ = combine_datasets(
    datasets=[
        jung_2018_corr, 
        kedzierski_2019_corr, 
        kedzierski_2019_u_corr,
        frond_2021_corr, 
        villegas_camacho_2024_c4_corr,
        villegas_camacho_2024_c8_corr
    ],
    wn_min=680,
    wn_max=3000,
    resolution=2.0,
    descending=True,
    method="pchip",
    label_column="type",
    exclude_columns=None,
    add_study_column=['sample_id', 'environmental', 'resolution'],
    study_names=[
        'jung_2018', 
        'kedzierski_2019', 
        'kedzierski_2019_u',
        'frond_2021', 
        'villegas_camacho_2024_c4',
        'villegas_camacho_2024_c8'
    ],
    show_progress=True,
    n_jobs=12,
    data_mode="normalized"
)

# Save to compressed CSV
print("\nSaving combined normalized data...")
combined_norm_data.to_csv('processed_data/combined_norm_data.csv.xz', compression='xz', index=None)
print(f"✓ Saved: combined_norm_data.csv.xz (shape: {combined_norm_data.shape})")
print("="*80)




COMBINING NORMALIZED DATASETS
This will interpolate all datasets to a common wavenumber grid...
Parameters: wn_range=(680, 3000), resolution=2.0 cm⁻¹, method='pchip'


DATASET COVERAGE ANALYSIS
Target grid: 680.0 - 3000.0 cm⁻¹ (2320.0 cm⁻¹ range)
Grid mode: intersection
----------------------------------------------------------------------
  jung_2018: 800 samples, range 680.0-3000.0 cm⁻¹, coverage: ✓ FULL
  kedzierski_2019: 970 samples, range 599.8-3996.0 cm⁻¹, coverage: ✓ FULL
  kedzierski_2019_u: 4064 samples, range 599.8-3997.9 cm⁻¹, coverage: ✓ FULL
  frond_2021: 380 samples, range 680.8-3000.8 cm⁻¹, coverage: ✓ FULL
  villegas_camacho_2024_c4: 3000 samples, range 679.8-3000.7 cm⁻¹, coverage: ✓ FULL
  villegas_camacho_2024_c8: 3000 samples, range 680.7-3000.7 cm⁻¹, coverage: ✓ FULL
----------------------------------------------------------------------
Total: 12214 samples, 12214 with full coverage (100.0%)



Resampling (pchip): 100%|██████████| 800/800 [00:01<00:00, 481.16it/s]
Resampling (pchip): 100%|██████████| 970/970 [00:00<00:00, 21457.30it/s]
Resampling (pchip): 100%|██████████| 4064/4064 [00:00<00:00, 27812.74it/s]
Resampling (pchip): 100%|██████████| 380/380 [00:00<00:00, 14928.45it/s]
Resampling (pchip): 100%|██████████| 3000/3000 [00:00<00:00, 18601.42it/s]
Resampling (pchip): 100%|██████████| 3000/3000 [00:00<00:00, 33625.99it/s]



Saving combined normalized data...
✓ Saved: combined_norm_data.csv.xz (shape: (12214, 1166))


### Combine 1st Derivative Data

In [17]:
print("\n" + "="*80)
print("COMBINING 1ST DERIVATIVE DATASETS")
print("="*80)

# Combine all 1st derivative datasets
combined_deriv1_data, _ = combine_datasets(
    datasets=[
        jung_2018_deriv1, 
        kedzierski_2019_deriv1, 
        kedzierski_2019_u_deriv1,
        frond_2021_deriv1, 
        villegas_camacho_2024_c4_deriv1,
        villegas_camacho_2024_c8_deriv1
    ],
    wn_min=680,
    wn_max=3000,
    resolution=2.0,
    descending=True,
    method="pchip",
    label_column="type",
    exclude_columns=None,
    add_study_column=['sample_id', 'environmental', 'resolution'],
    study_names=[
        'jung_2018', 
        'kedzierski_2019', 
        'kedzierski_2019_u',
        'frond_2021', 
        'villegas_camacho_2024_c4',
        'villegas_camacho_2024_c8'
    ],
    show_progress=True,
    n_jobs=12,
    data_mode="normalized"
)

# Save to compressed CSV
print("\nSaving combined 1st derivative data...")
combined_deriv1_data.to_csv('processed_data/combined_deriv1_data.csv.xz', compression='xz', index=None)
print(f"✓ Saved: combined_deriv1_data.csv.xz (shape: {combined_deriv1_data.shape})")
print("="*80)




COMBINING 1ST DERIVATIVE DATASETS

DATASET COVERAGE ANALYSIS
Target grid: 680.0 - 3000.0 cm⁻¹ (2320.0 cm⁻¹ range)
Grid mode: intersection
----------------------------------------------------------------------
  jung_2018: 800 samples, range 680.0-3000.0 cm⁻¹, coverage: ✓ FULL
  kedzierski_2019: 970 samples, range 599.8-3996.0 cm⁻¹, coverage: ✓ FULL
  kedzierski_2019_u: 4064 samples, range 599.8-3997.9 cm⁻¹, coverage: ✓ FULL
  frond_2021: 380 samples, range 680.8-3000.8 cm⁻¹, coverage: ✓ FULL
  villegas_camacho_2024_c4: 3000 samples, range 679.8-3000.7 cm⁻¹, coverage: ✓ FULL
  villegas_camacho_2024_c8: 3000 samples, range 680.7-3000.7 cm⁻¹, coverage: ✓ FULL
----------------------------------------------------------------------
Total: 12214 samples, 12214 with full coverage (100.0%)



Resampling (pchip): 100%|██████████| 800/800 [00:00<00:00, 17547.37it/s]
Resampling (pchip): 100%|██████████| 970/970 [00:00<00:00, 23787.38it/s]
Resampling (pchip): 100%|██████████| 4064/4064 [00:00<00:00, 29885.67it/s]
Resampling (pchip): 100%|██████████| 380/380 [00:00<00:00, 14697.31it/s]
Resampling (pchip): 100%|██████████| 3000/3000 [00:00<00:00, 20276.28it/s]
Resampling (pchip): 100%|██████████| 3000/3000 [00:00<00:00, 40918.31it/s]



Saving combined 1st derivative data...
✓ Saved: combined_deriv1_data.csv.xz (shape: (12214, 1166))


### Combine 2nd Derivative Data

In [18]:
print("\n" + "="*80)
print("COMBINING 2ND DERIVATIVE DATASETS")
print("="*80)

# Combine all 2nd derivative datasets
combined_deriv2_data, _ = combine_datasets(
    datasets=[
        jung_2018_deriv2, 
        kedzierski_2019_deriv2, 
        kedzierski_2019_u_deriv2,
        frond_2021_deriv2, 
        villegas_camacho_2024_c4_deriv2,
        villegas_camacho_2024_c8_deriv2
    ],
    wn_min=680,
    wn_max=3000,
    resolution=2.0,
    descending=True,
    method="pchip",
    label_column="type",
    exclude_columns=None,
    add_study_column=['sample_id', 'environmental', 'resolution'],
    study_names=[
        'jung_2018', 
        'kedzierski_2019', 
        'kedzierski_2019_u',
        'frond_2021', 
        'villegas_camacho_2024_c4',
        'villegas_camacho_2024_c8'
    ],
    show_progress=True,
    n_jobs=12,
    data_mode="normalized"
)

# Save to compressed CSV
print("\nSaving combined 2nd derivative data...")
combined_deriv2_data.to_csv('processed_data/combined_deriv2_data.csv.xz', compression='xz', index=None)
print(f"✓ Saved: combined_deriv2_data.csv.xz (shape: {combined_deriv2_data.shape})")
print("="*80)




COMBINING 2ND DERIVATIVE DATASETS

DATASET COVERAGE ANALYSIS
Target grid: 680.0 - 3000.0 cm⁻¹ (2320.0 cm⁻¹ range)
Grid mode: intersection
----------------------------------------------------------------------
  jung_2018: 800 samples, range 680.0-3000.0 cm⁻¹, coverage: ✓ FULL
  kedzierski_2019: 970 samples, range 599.8-3996.0 cm⁻¹, coverage: ✓ FULL
  kedzierski_2019_u: 4064 samples, range 599.8-3997.9 cm⁻¹, coverage: ✓ FULL
  frond_2021: 380 samples, range 680.8-3000.8 cm⁻¹, coverage: ✓ FULL
  villegas_camacho_2024_c4: 3000 samples, range 679.8-3000.7 cm⁻¹, coverage: ✓ FULL
  villegas_camacho_2024_c8: 3000 samples, range 680.7-3000.7 cm⁻¹, coverage: ✓ FULL
----------------------------------------------------------------------
Total: 12214 samples, 12214 with full coverage (100.0%)



Resampling (pchip): 100%|██████████| 800/800 [00:00<00:00, 17387.52it/s]
Resampling (pchip): 100%|██████████| 970/970 [00:00<00:00, 19247.94it/s]
Resampling (pchip): 100%|██████████| 4064/4064 [00:00<00:00, 21714.40it/s]
Resampling (pchip): 100%|██████████| 380/380 [00:00<00:00, 9280.30it/s]
Resampling (pchip): 100%|██████████| 3000/3000 [00:00<00:00, 15804.52it/s]
Resampling (pchip): 100%|██████████| 3000/3000 [00:00<00:00, 39192.63it/s]



Saving combined 2nd derivative data...
✓ Saved: combined_deriv2_data.csv.xz (shape: (12214, 1166))


---

## Step 6: Compute Derivatives of Combined Normalized Data

Finally, we'll compute 1st and 2nd derivatives of the combined normalized dataset. This provides an alternative to combining the derivatives of individual datasets.

In [19]:
print("\n" + "="*80)
print("COMPUTING DERIVATIVES OF COMBINED NORMALIZED DATA")
print("="*80)

# Initialize FTIRdataprocessing class for combined data
fdp = FTIRdataprocessing(
    df=combined_norm_data,
    label_column=LABEL_COLUMN,
    exclude_regions=EXCLUDE_REGIONS,
    interpolate_regions=INTERPOLATE_REGIONS,
    flat_windows=FLAT_WINDOWS
)

# Compute 1st derivative of combined normalized data
print("\n→ Computing 1st derivative of combined data...")
combined_norm_deriv1_data = fdp.derivatives(
    data=combined_norm_data,
    order=1,
    window_length=15,
    polyorder=3,
    delta=1.0,
    plot=False,
    save_plot=False,
    save_path=None,
)

# Compute 2nd derivative of combined normalized data
print("→ Computing 2nd derivative of combined data...")
combined_norm_deriv2_data = fdp.derivatives(
    data=combined_norm_data,
    order=2,
    window_length=15,
    polyorder=3,
    delta=1.0,
    plot=False,
    save_plot=False,
    save_path=None,
)

# Save both derivative datasets
print("\n→ Saving combined normalized derivative data...")
combined_norm_deriv1_data.to_csv('processed_data/combined_norm_deriv1_data.csv.xz', compression='xz', index=None)
combined_norm_deriv2_data.to_csv('processed_data/combined_norm_deriv2_data.csv.xz', compression='xz', index=None)

print(f"✓ Saved: combined_norm_deriv1_data.csv.xz (shape: {combined_norm_deriv1_data.shape})")
print(f"✓ Saved: combined_norm_deriv2_data.csv.xz (shape: {combined_norm_deriv2_data.shape})")
print("="*80)


COMPUTING DERIVATIVES OF COMBINED NORMALIZED DATA

→ Computing 1st derivative of combined data...
Computing 1st derivative for 12214 samples...
→ Computing 2nd derivative of combined data...
Computing 2nd derivative for 12214 samples...

→ Saving combined normalized derivative data...
✓ Saved: combined_norm_deriv1_data.csv.xz (shape: (12214, 1166))
✓ Saved: combined_norm_deriv2_data.csv.xz (shape: (12214, 1166))


---

## Summary and Next Steps

### Files Created

This notebook has created the following preprocessed data files:

**Combined Datasets:**
1. `combined_norm_data.csv.xz` - All datasets combined, normalized
2. `combined_deriv1_data.csv.xz` - All datasets combined, 1st derivative  
3. `combined_deriv2_data.csv.xz` - All datasets combined, 2nd derivative
4. `combined_norm_deriv1_data.csv.xz` - 1st derivative of combined normalized data
5. `combined_norm_deriv2_data.csv.xz` - 2nd derivative of combined normalized data

All files are saved in compressed CSV format (.csv.xz) for efficient storage and fast loading.

### File Sizes

The compressed files are typically 10-20x smaller than uncompressed CSV files, making them ideal for:
- Version control (if needed)
- File transfer
- Long-term storage
- Fast loading with pandas/polars

### Loading the Data

To load these files in a future session:

```python
import pandas as pd

# Load combined normalized data
df = pd.read_csv('combined_norm_data.csv.xz', compression='xz')

# Load combined 1st derivative data
df_deriv1 = pd.read_csv('combined_deriv1_data.csv.xz', compression='xz')
```

### What's Next?

Your preprocessed data is now ready for:

1. **Exploratory Data Analysis** (Notebook 6)
   - Mean spectra visualization
   - PCA, t-SNE, UMAP dimensionality reduction
   - Statistical analysis (ANOVA, correlation)
   - Clustering analysis

2. **Machine Learning Classification** (Notebook 6)
   - Train multiple classification models
   - Hyperparameter tuning
   - Model comparison
   - SHAP explainability analysis

3. **Custom Analysis**
   - Export to other software
   - Build custom models
   - Publication-ready figures

### Tips for Using the Combined Data

- **combined_norm_data.csv.xz**: Use this for most analyses, as it's the fully preprocessed, normalized data
- **combined_deriv1_data.csv.xz**: Use when you need enhanced spectral resolution (better peak separation)
- **combined_deriv2_data.csv.xz**: Use for identifying overlapping peaks and subtle features
- **Study column**: Each row contains a 'study' column indicating which dataset it came from

### Data Characteristics

- **Wavenumber range**: 680-3000 cm⁻¹ (consistent across all studies)
- **Resolution**: 2.0 cm⁻¹ (all spectra interpolated to this resolution)
- **Number of features**: 1161 wavenumber points
- **Preprocessing applied**: Denoising → Baseline correction → Atmospheric correction → Normalization

---

## Conclusion

You've successfully preprocessed and combined all 6 bundled FTIR datasets! The data is now optimally prepared for analysis. Proceed to Notebook 6 for exploratory data analysis and machine learning.