## Imports and Configuration

In [1]:
import os
import sys
import shutil
import time
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from tqdm import tqdm

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
from config.load_configuration import load_configuration

#### Loading configuration

This notebook loads configuration settings using the `load_configuration` function from the `config.load_configuration` module. The configuration is stored in the `config` variable.

In [2]:
config = load_configuration()

PC Name: DESKTOP-LUKAS
Loaded configuration from ../config/config_lukas.yaml


#### Setting random ssed

The random seed is set using `np.random.seed(config['seed'])` to ensure reproducibility of results throughout the data processing workflow.

In [3]:
np.random.seed(config['seed'])

#### Data Containers and Loading

- `data_with_features_train`, `data_with_features_test`, and `data_with_features_validation` are lists used to store processed ECG data for training, testing, and validation.
- All files from the directory specified in `config['path_to_matlab_data']` are loaded and read into `features_by_ecg_id` for further processing.

In [4]:
# Data containers
data_with_features_train = []
data_with_features_test = []
data_with_features_validation = []

# Load all files in the directory
files = os.listdir(config['path_to_matlab_data'])
print("Found " + str(len(files)) + " files in the directory: " + config['path_to_matlab_data'])

# Read all files and store in a list
features_by_ecg_id = []
for i in tqdm(range(len(files)), desc="Preloading data...", unit="file"):
    features_by_ecg_id.append(pd.read_csv(config['path_to_matlab_data'] + "/" + files[i]))

Found 21754 files in the directory: C:\Users\lukas\Documents\HKA_DEV\HKA_EKG_Signalverarbeitung_Data\ptb-xl-a-large-publicly-available-electrocardiography-dataset-1.0.3\preprocessing_output


Preloading data...: 100%|██████████| 21754/21754 [07:48<00:00, 46.41file/s]


#### Median Lead Calculation (disabled) and Normalization (disabled) and Test/Train/Val-Split

For each ECG file, the median lead is extracted and normalized using either Z-score or Min-Max normalization, based on the configuration. The processed data is then randomly assigned to training, testing, or validation sets (70/15/15 split).

In [None]:
def plot_raw_and_median_lead(raw_data_row_i, median_lead, i):
    # Plot the raw data and the median data
    plt.figure(figsize=(12, 6))
    plt.plot(raw_data_row_i['raw_data'], label='Raw Data')
    plt.plot(median_lead, label='Median Lead', linestyle='--')
    plt.legend()
    plt.title(f'ECG Data for Sample {i}')
    plt.xlabel('Data Points')
    plt.ylabel('Amplitude')
    plt.show()

print("Processing files by calculating the median lead and normalizing the data...")
for i in tqdm(range(0, len(files)), desc="Processing files", unit="file"):
    # Load raw data
    # raw_data_row_i = pd.DataFrame(features_by_ecg_id[i])
    df = pd.DataFrame(features_by_ecg_id[i])

    # Calculate the median lead of 12-lead-ecg
    # median_lead = raw_data_row_i['raw_data']

    # if config['normalization_method'] == "z-score":
    #     # Normalize median lead using Z-score normalization
    #     median_lead = (median_lead - np.mean(median_lead)) / np.std(median_lead)
    # if config['normalization_method'] == "min-max":
    #     # Normalize median lead using Min-Max normalization
    #     median_lead = (median_lead - np.min(median_lead)) / (np.max(median_lead) - np.min(median_lead))

    # plot_raw_and_median_lead(raw_data_row_i, median_lead, i)
    
    # Create a new DataFrame to store the data with features
    # df = raw_data_row_i

    # Replace 'raw-data' with median
    # df['raw_data'] = median_lead

    # Use random number to define if the data is used for training or testing or validation
    # 70% of the data is used for training, 15% for testing and 15% for validation
    random_number = np.random.rand()
    if random_number < 0.7:
        data_with_features_train.append(df)
    elif random_number >= 0.7 and random_number < 0.85:
        data_with_features_test.append(df)
    else:
        data_with_features_validation.append(df)

Processing files by calculating the median lead and normalizing the data...


Processing files: 100%|██████████| 21754/21754 [00:00<00:00, 83674.99file/s]


#### Data Augmentation and Saving

The processed ECG data is augmented by extracting multiple random 512-point segments from each sample in the training, testing, and validation sets. Each segment is saved as a separate CSV file in dedicated folders. Existing folders are cleared before saving the new augmented datasets. This step increases data diversity and prepares the data for model training.

In [6]:
print("Augmenting and saving data...")

perform_augmentation = False

# Get paths to data folders
path_train = config['path_to_data'] + "/pd_dataset_train"
path_test = config['path_to_data'] + "/pd_dataset_test"
path_val = config['path_to_data'] + "/pd_dataset_val"

# Delete folders and files
shutil.rmtree(path_train, ignore_errors=True)
shutil.rmtree(path_test, ignore_errors=True)
shutil.rmtree(path_val, ignore_errors=True)

# Generate Same structure again
os.makedirs(path_train)
os.makedirs(path_test)
os.makedirs(path_val)

# Iterate through all elements in data_with_features_train
# For each element, save 5 (parameter: augmentations) datasets with 512 datapoints
# start at datapoint 512 and end at len(data_with_features_train[i]) - 512, select the starting point randomly
if perform_augmentation:
    for i in tqdm(range(0, len(data_with_features_train)), desc="Augmenting train...", unit="sample"):
        for j in range(0, config['number_of_augmentations']):
            # Select a random starting point
            start_idx = np.random.randint(512, len(data_with_features_train[i]) - 512)
            # Extract 512 datapoints
            pd_dataset_train = data_with_features_train[i].iloc[start_idx:start_idx + 512]
            # Save the data to a file
            pd_dataset_train.to_csv(path_train + "/" + str(i) + "_" + str(j) + ".csv", index=False)

    for i in tqdm(range(0, len(data_with_features_test)), desc="Augmenting test...", unit="sample"):
        for j in range(0, config['number_of_augmentations']):
            start_idx = np.random.randint(512, len(data_with_features_test[i]) - 512)
            pd_dataset_test = data_with_features_test[i].iloc[start_idx:start_idx + 512]
            pd_dataset_test.to_csv(path_test + "/" + str(i) + "_" + str(j) + ".csv", index=False)

    for i in tqdm(range(0, len(data_with_features_validation)), desc="Augmenting validation...", unit="sample"):
        for j in range(0, config['number_of_augmentations']):
            start_idx = np.random.randint(512, len(data_with_features_validation[i]) - 512)
            pd_dataset_validation = data_with_features_validation[i].iloc[start_idx:start_idx + 512]
            pd_dataset_validation.to_csv(path_val + "/" + str(i) + "_" + str(j) + ".csv", index=False)

if not perform_augmentation: 
    for i in tqdm(range(0, len(data_with_features_train)), desc="Saving train...", unit="sample"):
        data_with_features_train[i].to_csv(path_train + "/" + str(i) + ".csv", index=False)

    for i in tqdm(range(0, len(data_with_features_test)), desc="Saving test...", unit="sample"):
        data_with_features_test[i].to_csv(path_test + "/" + str(i) + ".csv", index=False)

    for i in tqdm(range(0, len(data_with_features_validation)), desc="Saving validation...", unit="sample"):
        data_with_features_validation[i].to_csv(path_val + "/" + str(i) + ".csv", index=False)

print("Data Preprocessing finished!")

Augmenting and saving data...


Saving train...: 100%|██████████| 15251/15251 [11:14<00:00, 22.60sample/s]
Saving test...: 100%|██████████| 3219/3219 [02:27<00:00, 21.82sample/s]
Saving validation...: 100%|██████████| 3284/3284 [02:29<00:00, 21.96sample/s]

Data Preprocessing finished!



