# Integrating a Raw Data Preprocessing Pipeline into ML and DL Pipelines

This tutorial shows how to use a raw data preprocessing pipeline (for example, for ECG signals) and integrate it into both machine learning and deep learning workflows. We will:

- Preprocess raw biosignals (using filtering and denoising functions)
- Extract simple features from the preprocessed signals
- Build a traditional machine learning pipeline (using scikit-learn) to classify the signals
- Build a deep learning pipeline (using PyTorch) with a custom model and training loop

Follow along to see how raw signal processing can be seamlessly integrated into subsequent model training and inference steps.

## 1. Environment Setup and Imports

We start by importing the required packages. In addition to standard libraries like NumPy, SciPy, and Matplotlib, we also import NeuroKit2 for signal simulation, PyTorch for deep learning, and scikit-learn for the machine learning classifier.

Additionally, we import the repository’s advanced ECG denoising function and configuration (note: parts of the raw signal pipeline are provided by the ECG_ACORAI repository).

In [None]:
import sys
import os

# Get the repository root folder (assuming the notebook is in /your-project-root/notebooks)
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

print("Project root added to sys.path:", project_root)

import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import butter, filtfilt
import neurokit2 as nk
import torch

# scikit-learn imports for ML pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Import advanced ECG denoising from the repository (if available in your environment)
from ecg_processor_torch.advanced_denoising import wavelet_denoise
from ecg_processor_torch.config import ECGConfig

# For inline plotting
%matplotlib inline

# Set random seeds for reproducibility
np.random.seed(42)
torch.manual_seed(42)

print("Setup complete. ECG sampling rate:", ECGConfig.DEFAULT_SAMPLING_RATE)

## 2. Simulating and Preprocessing Raw Biosignals

In a typical scenario, signals are acquired from sensors. For this tutorial, we'll simulate a noisy ECG signal using NeuroKit2. We then apply a preprocessing pipeline that includes filtering and (for ECG) an advanced wavelet denoising method.

Below, we also define helper functions (for instance, Butterworth low-pass filtering) that can be reused for other signal types (e.g., EDA, EMG, Respiration).

In [None]:
# Define a Butterworth low-pass filter and filter function
def butter_lowpass(cutoff, fs, order=5):
    nyquist = 0.5 * fs
    normal_cutoff = cutoff / nyquist
    b, a = butter(order, normal_cutoff, btype='low', analog=False)
    return b, a

def filter_signal(data, cutoff, fs, order=5):
    b, a = butter_lowpass(cutoff, fs, order=order)
    y = filtfilt(b, a, data)
    return y

# Simulate a noisy ECG signal (10 seconds at fs Hz)
fs = 500  # sampling frequency
ecg_noisy = nk.ecg_simulate(duration=10, sampling_rate=fs, noise=0.1)

# Plot the raw ECG signal
t = np.linspace(0, 10, fs*10)
plt.figure(figsize=(12, 4))
plt.plot(t, ecg_noisy, label='Noisy ECG')
plt.title('Simulated Noisy ECG Signal')
plt.xlabel('Time (s)')
plt.ylabel('Amplitude')
plt.legend()
plt.show()

# Convert the ECG signal to a PyTorch tensor for the advanced denoising function
ecg_tensor = torch.tensor(ecg_noisy, dtype=torch.float32)

# Apply advanced wavelet denoising from the repository
ecg_denoised_tensor = wavelet_denoise(ecg_tensor)

# Convert the denoised signal back to NumPy for further processing and visualization
ecg_denoised = ecg_denoised_tensor.cpu().numpy()

plt.figure(figsize=(12, 4))
plt.plot(t, ecg_noisy, label='Noisy ECG', alpha=0.6)
plt.plot(t, ecg_denoised, label='Denoised ECG', linewidth=2)
plt.title('Advanced ECG Denoising')
plt.xlabel('Time (s)')
plt.legend()
plt.show()

## 3. Feature Extraction

After preprocessing the raw signals, we need to extract features that can be used for classification. In this example, we extract simple statistical features (mean, standard deviation, min, max) from each ECG signal. In actual applications, you might include frequency-domain, morphological, or other domain-specific features.

The following cell demonstrates a simple feature extraction function.

In [None]:
def extract_features(signal):
    features = {}
    features['mean'] = np.mean(signal)
    features['std'] = np.std(signal)
    features['min'] = np.min(signal)
    features['max'] = np.max(signal)
    return features

# Extract features from the denoised ECG signal
features = extract_features(ecg_denoised)
print("Extracted features:", features)

## 4. Integrating with a Machine Learning Pipeline

In this section, we demonstrate how to use the extracted features in a traditional machine learning model. We simulate a simple classification task where each ECG signal is assigned a label (here, we use synthetic labels for demonstration). Then we split the data into training and testing sets, train a Random Forest classifier from scikit-learn, and evaluate its accuracy.

In practice, you would extract features from many signals to create a feature matrix and label vector.

In [None]:
# For demonstration, create synthetic data

# Let's assume we have 100 samples (signals) and we extract the same 4 features from each
num_samples = 100
X = []
y = []
for i in range(num_samples):
    # Simulate a noisy ECG signal
    ecg_sample = nk.ecg_simulate(duration=10, sampling_rate=fs, noise=0.1)
    ecg_tensor_sample = torch.tensor(ecg_sample, dtype=torch.float32)
    ecg_denoised_sample = wavelet_denoise(ecg_tensor_sample).cpu().numpy()
    feats = extract_features(ecg_denoised_sample)
    # Create a feature vector
    feat_vector = [feats['mean'], feats['std'], feats['min'], feats['max']]
    X.append(feat_vector)
    
    # Simulate binary labels (0 or 1) for demonstration
    label = 0 if np.mean(ecg_denoised_sample) < 0 else 1
    y.append(label)

X = np.array(X)
y = np.array(y)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest classifier
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Predict and evaluate the model
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print("Random Forest Accuracy:", acc)

## 5. Integrating with a Deep Learning Pipeline (Using PyTorch)

For deep learning, we can build an end-to-end pipeline that accepts raw or preprocessed signals, extracts features, and feeds them to a neural network. In this example, we create a simple feed-forward network that uses the same feature vector as the ML pipeline. We define a PyTorch `Dataset` and `DataLoader`, build a simple model, run a training loop, and evaluate the performance.

This should give you a template that you can extend to more sophisticated deep learning architectures (like CNNs, RNNs, or transformers).

In [None]:
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Define a PyTorch Dataset for our feature vectors
class ECGDataset(Dataset):
    def __init__(self, features, labels):
        self.X = torch.tensor(features, dtype=torch.float32)
        self.y = torch.tensor(labels, dtype=torch.long)
    
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

# Create dataset and dataloader
dataset = ECGDataset(X, y)
train_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_size
train_dataset, test_dataset = torch.utils.data.random_split(dataset, [train_size, test_size])
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=16, shuffle=False)

# Define a simple feed-forward neural network
class SimpleNN(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_classes):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, num_classes)
    
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

# Instantiate the model, loss function, and optimizer
input_dim = 4   # four features per sample
hidden_dim = 16
num_classes = 2
model = SimpleNN(input_dim, hidden_dim, num_classes)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Training loop
num_epochs = 20
for epoch in range(num_epochs):
    model.train()
    total_loss = 0.0
    for inputs, labels in train_loader:
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    
    avg_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")

# Evaluation on test set
model.eval()
all_preds = []
all_labels = []
with torch.no_grad():
    for inputs, labels in test_loader:
        outputs = model(inputs)
        _, preds = torch.max(outputs, 1)
        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(labels.cpu().numpy())

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(all_labels, all_preds)
print("Deep Learning Model Accuracy:", accuracy)

## 6. Conclusion

In this tutorial we integrated a raw data preprocessing pipeline into both a traditional machine learning pipeline and a deep learning pipeline. We:

- Simulated and preprocessed raw ECG signals
- Extracted simple statistical features
- Built a Random Forest classifier using scikit-learn
- Built a simple PyTorch-based neural network for classification

This comprehensive integration demonstrates how you can tie together signal preprocessing, feature extraction, and different model training frameworks into a seamless workflow. In real projects, you may replace the simple feature extraction method with complex domain-specific routines and replace the simple models with more sophisticated architectures.

Happy coding and model building!