In [1]:
%load_ext autoreload
%autoreload 2

# Feature Extraction in ECG Data

Feature extraction is a critical step in analyzing ECG data, especially for tasks like Atrial Fibrillation (AF) classification. The ECG signal is rich with information, but this information needs to be extracted into a format that can be used for machine learning models. 

## Key Features for AF Classification

### R-R Interval Features
R-R intervals, the periods between consecutive R-peaks in the ECG signal, are fundamental in assessing heart rhythm. Key features extracted from R-R intervals include:
- **RR_mean**: The average time between R-peaks, providing a basic measure of heart rate.
- **RR_std**: The standard deviation of R-R intervals, indicating the variability in heart rate, which is significant in AF detection.
- **Irregularity_index**: A measure of rhythm irregularity, calculated as the proportion of significant changes in successive R-R intervals (e.g., greater than 50ms). This index is particularly relevant for AF, where irregular heartbeats are a primary symptom.

### Frequency Domain Features
These features are derived from the power spectral density of the ECG signal and include:
- **LF (Low Frequency Power)**: Represents a mix of sympathetic and parasympathetic nervous system activity.
- **HF (High Frequency Power)**: More closely related to parasympathetic activity.
- **LF/HF Ratio**: Provides insights into the autonomic balance or stress levels, which can be altered in AF.

### Statistical Features
Simple statistical measures of the ECG signal can also provide valuable information:
- **Skewness**: Indicates the asymmetry of the ECG signal distribution. An abnormal skewness could suggest alterations in the ECG waveform.
- **Kurtosis**: Measures the 'tailedness' of the signal distribution. Extreme values might indicate anomalies in the ECG waveform.

## Exclusion of HRV Features in AF Classification

HRV measures the variation in time intervals between heartbeats and is critical for cardiac health analysis. But in the context of the project "AF Classification from a Short Single Lead ECG Recording," the decision to exclude Heart Rate Variability (HRV) features is based on several considerations:

1. **Short Recording Duration**: HRV analysis typically requires longer ECG recordings to provide meaningful insights. Since our project deals with short-duration ECG signals, the utility of HRV features is likely to be limited.

2. **Single Lead ECG Limitation**: HRV features are generally extracted from multi-lead ECG recordings to gain a comprehensive understanding of the heart's electrical activity. The single lead ECG used in our project offers less scope for diverse HRV analysis.

3. **Specific Focus on Atrial Fibrillation (AF)**: While HRV is valuable for assessing overall cardiac function and autonomic nervous system activity, its direct relevance to detecting AF in short-duration ECGs is not as strong compared to other features like R-R interval irregularity, frequency domain characteristics, and simple statistical measures of the ECG signal.

4. **Simplifying the Model**: By focusing on a select set of features that are more directly related to AF detection, we aim to develop a model that is both efficient and effective. Incorporating HRV might introduce complexity without significantly improving the model's performance for this specific application.

In summary, the exclusion of HRV features is a strategic choice tailored to the specific requirements and constraints of the project. This approach aims to optimize the model's performance in detecting AF from short, single lead ECG recordings.

In [8]:
import neurokit2 as nk
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from tqdm import tqdm

import warnings
warnings.filterwarnings('ignore')

In [3]:
from src.data.load_data import load_dataset, load_ecg, parse_header
from src.ecg.viz import plot_ecg
from src.config import TRAINING_DIR

dataset = load_dataset(TRAINING_DIR)

100%|██████████| 8528/8528 [00:00<00:00, 490450.21it/s]


In [4]:
from src.preprocessing.ecg_preprocessing import ECGPreprocessor

ecg_preprocessor = ECGPreprocessor(window_size=5000, overlap_size=1000)
processed_dataset = ecg_preprocessor.transform(dataset)                                                                                                                                 

Preprocessing ECG signals:   0%|          | 0/8528 [00:00<?, ?it/s]

Preprocessing ECG signals: 100%|██████████| 8528/8528 [00:11<00:00, 755.59it/s]


In [5]:
import numpy as np
import scipy.signal as sig
from scipy import stats

def extract_features(ecg_signal, fs):
    # ECG processing to find R-peaks and segment the signal
    _, info = nk.ecg_process(ecg_signal, sampling_rate=fs)
    
    # R-R Interval features
    rri = np.diff(info['ECG_R_Peaks']) / fs * 1000 # convert to ms
    rr_features = {
        'RR_mean': np.mean(rri),
        'RR_std': np.std(rri),
        'Irregularity_index': np.sum(np.abs(np.diff(rri)) > 50) / len(rri)
    }
    
    # Frequency Domain Features
    f, Pxx = sig.welch(ecg_signal, fs=fs)
    lf = np.trapz(Pxx[(f >= 0.04) & (f <= 0.15)])  # Low frequency power
    hf = np.trapz(Pxx[(f >= 0.15) & (f <= 0.4)])   # High frequency power
    freq_features = {
        'LF': lf,
        'HF': hf,
        'LF_HF_ratio': lf / hf if hf > 1e-10 else np.nan
    }

    # Statistical Features
    stat_features = {
        'Skewness': stats.skew(ecg_signal),
        'Kurtosis': stats.kurtosis(ecg_signal)
    }

    # Combine all features
    features = {**rr_features, **freq_features, **stat_features}
    
    return features

In [9]:
features = []

for data_dict in tqdm(processed_dataset, desc='Extracting features', total=len(processed_dataset)):
    ecg_signal = data_dict['ecg_signal']
    patient_id = data_dict['patient_id']
    label = data_dict['label']
    
    header = parse_header(data_dict['hea_file'])
    sampling_rate = header['sample_rate']
    
    try:
        feature_dict = extract_features(ecg_signal, sampling_rate)
        features.append(feature_dict)
    except ValueError as e:
        if str(e) == "NeuroKit error: the window cannot contain more data points than the time series. Decrease 'scale'.":
            continue
        if str(e) == "cannot convert float NaN to integer":
            continue
        else:
            plt.plot(ecg_signal)
            plt.title(f'Patient {patient_id} ECG {label}')
            plt.show()
            raise
    except ZeroDivisionError as e:
        continue
    except Exception as e:
        plt.plot(ecg_signal)
        plt.title(f'Patient {patient_id} ECG {label}')
        plt.show()
        raise

Extracting features:   0%|          | 0/17613 [00:00<?, ?it/s]

Extracting features: 100%|██████████| 17613/17613 [22:47<00:00, 12.88it/s]
