# Notebook 4: Get Preprocessed Data (Single Dataset)

## Overview

This notebook applies the **complete preprocessing pipeline** to a single FTIR dataset using the methods you selected in Notebooks 1-3. This produces publication-ready, fully preprocessed spectra.

### What You'll Learn

1. How to apply all preprocessing steps in the correct order
2. How to use the convenient helper method for full preprocessing
3. How to save your preprocessed data for analysis
4. How to verify preprocessing results

### Prerequisites

Before running this notebook, you should have completed Notebooks 1-3 to determine:
- ✓ **Best denoising method** (Notebook 1)
- ✓ **Best baseline correction method** (Notebook 2)
- ✓ **Best normalization method** (Notebook 3)

### Preprocessing Pipeline

The complete pipeline follows this order (order matters!):

1. **Convert** to absorbance
2. **Denoise** to remove random noise
3. **Correct baseline** to remove drift
4. **Handle atmospheric interference** (CO₂, H₂O)
5. **Normalize** to make spectra comparable

### Expected Output

- **Fully preprocessed dataset** ready for analysis
- **Excel file** with processed spectra
- **Visual comparison** of original vs. preprocessed data
- **Ready for machine learning** classification

---

## Step 1: Define Preprocessing Parameters

Based on your results from Notebooks 1-3, update the parameters below with your selected methods.

In [None]:
# Import required modules
import polars as pl
from xpectrass import FTIRdataprocessing
from xpectrass import load_villegas_camacho_2024_c4

# Load your dataset
# You can use any of the bundled datasets or your own data
dataset = load_villegas_camacho_2024_c4()
print('Dataset shape:', dataset.shape)

# ============================================================================
# CONFIGURATION: Update these parameters based on your Notebooks 1-3 results
# ============================================================================

# Label column (contains polymer type information)
LABEL_COLUMN = "type"

# Flat windows for baseline correction evaluation
FLAT_WINDOWS = [(1880, 1900), (2400, 2700)]

# SELECTED METHOD FROM NOTEBOOK 1
DENOISING_METHOD = 'wavelet'  # Options: 'savgol', 'wavelet', 'gaussian', 'median', etc.

# SELECTED METHOD FROM NOTEBOOK 2
BASELINE_CORRECTION_METHOD = 'aspls'  # Options: 'asls', 'airpls', 'mor', 'snip', etc.

# Atmospheric correction settings
# Define regions to exclude (completely remove from analysis)
EXCLUDE_REGIONS = [
    (0, 680),       # Low wavenumber noise + CO₂ bending (670 cm⁻¹)
    (3500, 5000)    # High wavenumber noise + O–H stretch
]

# Define regions to interpolate (replace with baseline)
INTERPOLATE_REGIONS = [
    (1250, 2700)    # H₂O bend + CO₂ stretch regions
]

# Interpolation method for atmospheric regions
INTERPOLATE_METHOD = "zero"  # Options: 'zero', 'linear', 'spline'

# SELECTED METHOD FROM NOTEBOOK 3
NORMALIZATION_METHOD = "spectral_moments"  # Options: 'snv', 'vector', 'minmax', 'area', etc.

# ============================================================================

print("\n" + "="*80)
print("PREPROCESSING CONFIGURATION")
print("="*80)
print(f"Dataset: Villegas-Camacho 2024 C4 ({len(dataset)} samples)")
print(f"\nSelected Methods:")
print(f"  1. Denoising:          {DENOISING_METHOD}")
print(f"  2. Baseline:           {BASELINE_CORRECTION_METHOD}")
print(f"  3. Atmospheric:        Interpolate method = {INTERPOLATE_METHOD}")
print(f"  4. Normalization:      {NORMALIZATION_METHOD}")
print(f"\nRegion Settings:")
print(f"  Exclude:     {EXCLUDE_REGIONS}")
print(f"  Interpolate: {INTERPOLATE_REGIONS}")
print(f"  Flat windows: {FLAT_WINDOWS}")
print("="*80 + "\n")

print("First few rows of raw data:")
print(dataset.head(5))

### Important Notes

1. **Update the methods above** with your results from Notebooks 1-3
2. **Verify your settings** match your experimental requirements
3. **Region settings** should match your data characteristics

---

## Step 2: Apply Complete Preprocessing Pipeline

Now we'll apply all preprocessing steps in one convenient method call.

In [None]:
# Initialize the preprocessing pipeline
print("Initializing FTIRdataprocessing pipeline...")
fdp = FTIRdataprocessing(
    df=dataset,
    label_column=LABEL_COLUMN,
    exclude_regions=EXCLUDE_REGIONS,
    interpolate_regions=INTERPOLATE_REGIONS,
    flat_windows=FLAT_WINDOWS
)

print("\n" + "="*80)
print("APPLYING FULL PREPROCESSING PIPELINE")
print("="*80)
print("\nProcessing steps:")
print("  Step 1/5: Converting to absorbance...")
print("  Step 2/5: Denoising spectra...")
print("  Step 3/5: Correcting baseline...")
print("  Step 4/5: Handling atmospheric interference...")
print("  Step 5/5: Normalizing spectra...")
print("\nThis may take a few minutes for large datasets...\n")

# Apply full preprocessing pipeline:
# 1. Convert to absorbance
# 2. Denoise
# 3. Correct baseline
# 4. Atmospheric correction (exclude + interpolate)
# 5. Normalize
# 
# _get_normalized_data() is a convenience method that applies all steps
df_preprocessed = fdp._get_normalized_data(
    denoising_method=DENOISING_METHOD,
    baseline_correction_method=BASELINE_CORRECTION_METHOD,
    interpolate_method=INTERPOLATE_METHOD,
    normalization_method=NORMALIZATION_METHOD,
    plot=True,  # Show before/after comparison
)

print("\n" + "="*80)
print("PREPROCESSING COMPLETE!")
print("="*80)
print(f"\nOriginal data shape:     {dataset.shape}")
print(f"Preprocessed data shape: {df_preprocessed.shape}")
print(f"\nFeatures:")
print(f"  Original wavenumbers:    {dataset.shape[1] - 1}")
print(f"  After region exclusion:  {df_preprocessed.shape[1] - 1}")
print(f"  Reduction:               {dataset.shape[1] - df_preprocessed.shape[1]} features removed")

# Save the fully preprocessed data
output_file = 'DenoisedBaselineAtmosphericCorrectedNormalizedData.xlsx'
print(f"\nSaving preprocessed data to: {output_file}")
df_preprocessed.to_excel(output_file, index=False)
print(f"✓ Data saved successfully!")

print("\n" + "="*80)
print("PREPROCESSING SUMMARY")
print("="*80)
print(f"✓ {len(df_preprocessed)} spectra processed")
print(f"✓ {df_preprocessed[LABEL_COLUMN].nunique()} polymer types")
print(f"✓ {df_preprocessed.shape[1] - 1} wavenumber features")
print(f"✓ Data is ready for analysis and machine learning")
print("="*80)

### What Just Happened?

The `_get_normalized_data()` method applied all preprocessing steps sequentially:

1. **Conversion**: Transmittance → Absorbance (if needed)
2. **Denoising**: Applied your selected denoising method
3. **Baseline Correction**: Applied your selected baseline method
4. **Atmospheric Correction**: 
   - Excluded regions outside 680-3500 cm⁻¹
   - Interpolated over H₂O and CO₂ regions (1250-2700 cm⁻¹)
5. **Normalization**: Applied your selected normalization method

### Output Files

- **`DenoisedBaselineAtmosphericCorrectedNormalizedData.xlsx`**: 
  - Fully preprocessed spectra
  - Ready for analysis and machine learning
  - Can be loaded directly into FTIRdataanalysis class

### Alternative: Step-by-Step Approach

If you prefer more control, you can apply each step manually:

```python
# Manual approach (equivalent to _get_normalized_data())
fdp = FTIRdataprocessing(df, label_column="type", 
                         exclude_regions=EXCLUDE_REGIONS,
                         interpolate_regions=INTERPOLATE_REGIONS,
                         flat_windows=FLAT_WINDOWS)

# Step 1: Convert
fdp.convert(mode="to_absorbance", plot=True)

# Step 2: Denoise
fdp.denoise_spect(method=DENOISING_METHOD, plot=True)

# Step 3: Baseline correction
fdp.correct_baseline(method=BASELINE_CORRECTION_METHOD, plot=True)

# Step 4: Atmospheric correction
fdp.exclude_interpolate(method=INTERPOLATE_METHOD, plot=True)

# Step 5: Normalize
fdp.normalize(method=NORMALIZATION_METHOD, plot=True)

# Get final data
df_preprocessed = fdp.df_norm
```

Both approaches produce identical results. The helper method is more convenient, while the manual approach provides more visualization options.

---

## Next Steps

Your preprocessed data is now ready for:

1. **Machine Learning Classification** (see Notebook 6)
   - Train classification models
   - Evaluate performance
   - Hyperparameter tuning
   - SHAP explainability

2. **Statistical Analysis** (see Notebook 6)
   - PCA, t-SNE, UMAP visualization
   - ANOVA analysis
   - Clustering analysis
   - Correlation studies

3. **External Analysis**
   - Export to other software (The Unscrambler, SIMCA, etc.)
   - Custom analysis pipelines
   - Publication-ready figures

### Loading Preprocessed Data Later

To load your preprocessed data in a future session:

```python
import pandas as pd
from xpectrass import FTIRdataanalysis

# Load preprocessed data
df = pd.read_excel('DenoisedBaselineAtmosphericCorrectedNormalizedData.xlsx', index_col=0)

# Initialize analysis
analysis = FTIRdataanalysis(df, label_column="type")

# Proceed with analysis...
analysis.plot_pca()
analysis.run_all_models()
```

---

## Conclusion

You've successfully applied the complete preprocessing pipeline to your FTIR dataset! The data is now optimally prepared for analysis. 

- For processing **all bundled datasets** at once, see Notebook 5
- For **analysis and machine learning**, proceed to Notebook 6