# Data Preprocessing for ICENTIA11K Dataset

After exploring the dataset using the notebook [Explore ICENTIA11K Dataset Notebook](./explore_ICENTIA11K_dataset.ipynb), the next step is to preprocess the data. As a reminder, the main sample/label in the dataset is represented by the `p_signal`, which is in the format of `np.ndarray`.

## Preprocessing Steps:

1. **Transform to Tensor:**
   - Convert the `np.ndarray` samples to tensors.

2. **Divide into Features and Labels:**
   - Split the tensor into features (X) and labels (y). The initial split ratio will be 9:1.

3. **Adjusting Sample Length:**
   - To maintain consistency, fix the length of each sample. Split each file into as many examples as possible, ignoring any remainder. It's important to note that this length can be adjusted as a parameter in the script.

4. **Saving Data as Tensors:**
   - For time efficiency, save the preprocessed data as tensors. This helps in quick loading and further analysis without the need for repetitive preprocessing.

### Implementation Details:

To implement these steps, refer to the code in this notebook. Additionally, remember that the sample length is a parameter that can be adjusted based on experimentation. It's worth noting that the script will handle the splitting of files into fixed-length examples during runtime, without creating new files.

This preprocessing stage ensures that the data is in a suitable format for training machine learning models. Adjusting the sample length allows for flexibility in model training and experimentation.



Use [SSSD-main/.../timeseries_utils.py](SSSD-main/docs/instructions/PTB-XL/clinical_ts/timeseries_utils.py) functions to preprocess:

1. ToTensor(object):
    """Convert ndarrays in sample to Tensors."""
2. class TimeseriesDatasetCrops(torch.utils.data.Dataset)

In [2]:
import sys
import os
import h5py
import tqdm

sys.path.append('..')  # Add the parent directory to the sys.path

import utils.data_preparation as data_preparation

os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
os.environ['CUDA_VISIBLE_DEVICES'] = '6'

! gpustat

subset_data_dir = "/home/liranc6/ecg/ecg_forecasting/data/icentia11k-continuous-ecg_normal_sinus_subset/" #patients 0-8

[1m[37mrambo5                       [m  Sat Dec  9 10:55:46 2023  [1m[30m525.116.04[m
[36m[0][m [34mNVIDIA GeForce RTX 2080 Ti[m |[31m 25°C[m, [32m  0 %[m | [1m[33m    1[m / [33m11264[m MB |
[36m[1][m [34mNVIDIA GeForce RTX 2080 Ti[m |[31m 24°C[m, [32m  0 %[m | [1m[33m    1[m / [33m11264[m MB |
[36m[2][m [34mNVIDIA GeForce RTX 2080 Ti[m |[31m 26°C[m, [32m  0 %[m | [1m[33m    1[m / [33m11264[m MB |
[36m[3][m [34mNVIDIA GeForce RTX 2080 Ti[m |[31m 26°C[m, [32m  0 %[m | [1m[33m    1[m / [33m11264[m MB |
[36m[4][m [34mNVIDIA GeForce RTX 2080 Ti[m |[31m 24°C[m, [32m  0 %[m | [1m[33m    1[m / [33m11264[m MB |
[36m[5][m [34mNVIDIA GeForce RTX 2080 Ti[m |[31m 25°C[m, [32m  0 %[m | [1m[33m    1[m / [33m11264[m MB |
[36m[6][m [34mNVIDIA GeForce RTX 2080 Ti[m |[31m 25°C[m, [32m  0 %[m | [1m[33m    1[m / [33m11264[m MB |
[36m[7][m [34mNVIDIA GeForce RTX 2080 Ti[m |[31m 26°C[m, [32m  0 %

In [3]:
def dir_tree(path, level=0):
    for root, _, files in os.walk(path):
        indent = ' ' * level
        print(f'{indent}{root}')
        for file in files:
            print(f'{indent}└─{file}')
        level += 1

dir_tree(subset_data_dir)

/home/liranc6/ecg/ecg_forecasting/data/icentia11k-continuous-ecg_normal_sinus_subset/
 /home/liranc6/ecg/ecg_forecasting/data/icentia11k-continuous-ecg_normal_sinus_subset/p00
  /home/liranc6/ecg/ecg_forecasting/data/icentia11k-continuous-ecg_normal_sinus_subset/p00/p00000
  └─p00000_s00_128028_to_281500.hea
  └─p00000_s00_128028_to_281500.dat
  └─p00000_s00_128028_to_281500.atr
  └─p00000_s00_354672_to_563455.hea
  └─p00000_s00_354672_to_563455.dat
  └─p00000_s00_354672_to_563455.atr
  └─p00000_s00_807626_to_1048448.hea
  └─p00000_s00_807626_to_1048448.dat
  └─p00000_s00_807626_to_1048448.atr
  └─p00000_s01_844645_to_1024432.hea
  └─p00000_s01_844645_to_1024432.dat
  └─p00000_s01_844645_to_1024432.atr
  └─p00000_s02_116787_to_276505.hea
  └─p00000_s02_116787_to_276505.dat
  └─p00000_s02_116787_to_276505.atr
  └─p00000_s02_278461_to_507274.hea
  └─p00000_s02_278461_to_507274.dat
  └─p00000_s02_278461_to_507274.atr
  └─p00000_s02_576350_to_797060.hea
  └─p00000_s02_576350_to_797060.dat


In [4]:
#phisionet to nparray
pSignal_npArray_data_dir_h5 = '/home/liranc6/ecg/ecg_forecasting/data/icentia11k-continuous-ecg_normal_sinus_subset_npArrays.h5'

# data_preparation.extract_and_save_p_signal_to_HDF5(subset_data_dir, pSignal_npArray_data_dir_h5)

In [5]:
data_preparation.print_h5_hierarchy(pSignal_npArray_data_dir_h5)

Group: p00
  Group: p00000
    Dataset: p00000_s00_128028_to_281500_p_signal
    Dataset: p00000_s00_354672_to_563455_p_signal
    Dataset: p00000_s00_807626_to_1048448_p_signal
    Dataset: p00000_s01_844645_to_1024432_p_signal
    Dataset: p00000_s02_116787_to_276505_p_signal
    Dataset: p00000_s02_278461_to_507274_p_signal
    Dataset: p00000_s02_576350_to_797060_p_signal
    Dataset: p00000_s02_801926_to_1048489_p_signal
    Dataset: p00000_s03_132_to_486821_p_signal
    Dataset: p00000_s03_489344_to_798829_p_signal
    Dataset: p00000_s03_801366_to_1021262_p_signal
    Dataset: p00000_s04_199_to_389438_p_signal
    Dataset: p00000_s04_505855_to_1048392_p_signal
    Dataset: p00000_s05_19_to_439676_p_signal
    Dataset: p00000_s05_441869_to_1025683_p_signal
    Dataset: p00000_s06_143423_to_405778_p_signal
    Dataset: p00000_s06_406947_to_696003_p_signal
    Dataset: p00000_s06_697652_to_922823_p_signal
    Dataset: p00000_s07_120981_to_493221_p_signal
    Dataset: p00000_s07_494

In [6]:
# split the arrays to fixed size windows
fs = 250
context_window_size = 9*60*fs  # minutes * seconds * fs
label_window_size = 1*60*fs  # minutes * seconds * fs
window_size = context_window_size+label_window_size


split_pSignal_file = '/home/liranc6/ecg/ecg_forecasting/data/icentia11k-continuous-ecg_normal_sinus_subset_npArrays_splits/10minutes_window.h5'

base_name, extension = os.path.splitext(os.path.basename(split_pSignal_file))
new_base_name = f"{base_name}_temp{extension}"
temp_filename = os.path.join(os.path.dirname(split_pSignal_file), new_base_name)
data_preparation.split_and_save_data(pSignal_npArray_data_dir_h5, window_size, temp_filename)
data_preparation.merge_datasets(temp_filename, split_pSignal_file)

Processing p00000_s00_128028_to_281500_p_signal: 100%|██████████| 1/1 [00:00<00:00, 341.81window/s]
Processing p00000_s00_354672_to_563455_p_signal: 100%|██████████| 1/1 [00:00<00:00, 415.32window/s]
Processing p00000_s00_807626_to_1048448_p_signal: 100%|██████████| 1/1 [00:00<00:00, 364.60window/s]
Processing p00000_s01_844645_to_1024432_p_signal: 100%|██████████| 1/1 [00:00<00:00, 626.11window/s]
Processing p00000_s02_116787_to_276505_p_signal: 100%|██████████| 1/1 [00:00<00:00, 408.44window/s]
Processing p00000_s02_278461_to_507274_p_signal: 100%|██████████| 1/1 [00:00<00:00, 350.05window/s]
Processing p00000_s02_576350_to_797060_p_signal: 100%|██████████| 1/1 [00:00<00:00, 352.31window/s]
Processing p00000_s02_801926_to_1048489_p_signal: 100%|██████████| 1/1 [00:00<00:00, 564.21window/s]
Processing p00000_s03_132_to_486821_p_signal: 100%|██████████| 3/3 [00:00<00:00, 595.25window/s]
Processing p00000_s03_489344_to_798829_p_signal: 100%|██████████| 2/2 [00:00<00:00, 237.82window/s]


In [17]:
def split_and_save_data(input_h5_file, window_size, output_h5_file):
    """
    Split each dataset in the input HDF5 file into windows of the specified size
    and save the resulting windows into an output HDF5 file.

    :param input_h5_file: The input HDF5 file with datasets to split.
    :param window_size: The size of each window.
    :param output_h5_file: The output HDF5 file to save the split data.
    :return: None
    """
    # Create the output directory if it doesn't exist
    os.makedirs(os.path.dirname(output_h5_file), exist_ok=True)

    with h5py.File(input_h5_file, 'r') as input_file, h5py.File(output_h5_file, 'w') as output_file:
        # Define a recursive function to process groups and datasets
        def process_group(input_group, output_group):
            for name, item in input_group.items():
                if isinstance(item, h5py.Group): # create only leaf groups
                    # TODO: create only leaf groups
                
                elif isinstance(item, h5py.Dataset):
                    # Split the dataset into windows
                    data = item[:]
                    num_windows = len(data) // window_size

                    # Save each window as numpy array and add it to the output dataset
                    for i in tqdm(range(num_windows), desc=f"Processing {name}", unit="window"):
                        window_data = data[i * window_size: (i + 1) * window_size]
                        # TODO
                        
                        

        # Start processing from the root group
        process_group(input_file, output_file)


In [58]:
import h5py
import numpy as np
import os
from tqdm import tqdm

def split_and_save_data(input_h5_file, window_size, output_h5_file):
    """
    Split each dataset in the input HDF5 file into windows of the specified size
    and save the resulting windows into an output HDF5 file.

    :param input_h5_file: The input HDF5 file with datasets to split.
    :param window_size: The size of each window.
    :param output_h5_file: The output HDF5 file to save the split data.
    :return: None
    """
    
    def extract_integers(text):
        """
        Extract integers from the given text.
    
        :param text: The input text containing characters and integers.
        :return: A string containing only the integers found in the text.
        """
        return ''.join(filter(str.isdigit, str(text)))

    # Create the output directory if it doesn't exist
    os.makedirs(os.path.dirname(output_h5_file), exist_ok=True)

    with h5py.File(input_h5_file, 'r') as input_file, h5py.File(output_h5_file, 'w') as output_file:
        for group_name, group_item in input_file.items():
            assert not isinstance(group_name, h5py.Group), "create only leaf groups"
            for subgroup_name, subgroup_item in tqdm(group_item.items(), desc="Processing Subgroups", unit="subgroup"):
                print(f"subgroup_name: {subgroup_name}")
                assert not isinstance(subgroup_name, h5py.Group), "leaf groups"
                dataset_data = []
                for dataset_name, dataset_item in tqdm(subgroup_item.items(), desc="Processing datasets"):
                    # print(f"dataset_name: {dataset_name}")
                    assert not isinstance(dataset_name, h5py.Dataset)
                    # Split the dataset into windows
                    data = dataset_item[:]
                    num_windows = len(data) // window_size

                    # Save each window as numpy array and add it to the output dataset
                    for i in range(num_windows):
                        window_data = data[i * window_size: (i + 1) * window_size]
                        dataset_data.append(window_data)
                            
                    dataset_name = extract_integers(subgroup_name)
                output_file.create_dataset(dataset_name, data=dataset_data)
                            

In [78]:
import h5py
import numpy as np
import os
from tqdm import tqdm

def split_and_save_data(input_h5_file, window_size, output_h5_file):
    """
    Split each dataset in the input HDF5 file into windows of the specified size
    and save the resulting windows into an output HDF5 file.

    :param input_h5_file: The input HDF5 file with datasets to split.
    :param window_size: The size of each window.
    :param output_h5_file: The output HDF5 file to save the split data.
    :return: None
    """
    
    def extract_integers(text):
        """
        Extract integers from the given text.
    
        :param text: The input text containing characters and integers.
        :return: A string containing only the integers found in the text.
        """
        return ''.join(filter(str.isdigit, str(text)))

    # Create the output directory if it doesn't exist
    os.makedirs(os.path.dirname(output_h5_file), exist_ok=True)

    with h5py.File(input_h5_file, 'r') as input_file, h5py.File(output_h5_file, 'w') as output_file:
        total_leaf_iterations = 0
        for group_name, group_item in input_file.items():
            assert not isinstance(group_name, h5py.Group), "create only leaf groups"
            for subgroup_name, subgroup_item in tqdm(group_item.items(), desc="Processing Subgroups", unit="subgroup"):
                # print(f"subgroup_name: {subgroup_name}")
                assert not isinstance(subgroup_name, h5py.Group), "leaf groups"
                dataset_data = []
                total_leaf_iterations += len(subgroup_item) 
                
        progress_bar = tqdm(total=total_leaf_iterations, position=0, leave=False, desc='Processing')
        for group_name, group_item in input_file.items():
            assert not isinstance(group_name, h5py.Group), "create only leaf groups"
            for subgroup_name, subgroup_item in group_item.items():
                # print(f"subgroup_name: {subgroup_name}")
                assert not isinstance(subgroup_name, h5py.Group), "leaf groups"
                dataset_data = []
                for dataset_name, dataset_item in subgroup_item.items():
                    # print(f"dataset_name: {dataset_name}")
                    assert not isinstance(dataset_name, h5py.Dataset)
                    # Split the dataset into windows
                    data = dataset_item[:]
                    num_windows = len(data) // window_size
            
                    # Save each window as numpy array and add it to the output dataset
                    for i in range(num_windows):
                        window_data = data[i * window_size: (i + 1) * window_size]
                        dataset_data.append(window_data)
                            
                    dataset_name = extract_integers(subgroup_name)
                    progress_bar.update(1)
                output_file.create_dataset(dataset_name, data=dataset_data)                     

In [79]:
split_and_save_data(pSignal_npArray_data_dir_h5, window_size, temp_filename)
data_preparation.print_h5_hierarchy(temp_filename)
# data_preparation.count_items(split_pSignal_file)


Processing Subgroups:   0%|          | 0/9 [00:00<?, ?subgroup/s][A
Processing Subgroups: 100%|██████████| 9/9 [00:00<00:00, 72.24subgroup/s][A
                                                             

Dataset: 00000
Dataset: 00001
Dataset: 00002
Dataset: 00003
Dataset: 00004
Dataset: 00005
Dataset: 00006
Dataset: 00007
Dataset: 00008


