# QRS training data Preprocessor

Data preprocessor to build the training and validation data for our neural network

This notebook implementes all the necessary steps to run the computations needed to create the training and test data files.

## Importing libraries

In [1]:
import wfdb
import numpy as np
import pickle as pkl
import matplotlib.pyplot as plt
import traceback

## Creating the parabola function

This function will create a parabola around a spike in order to give it more width so it can be more easily detected

In [3]:
# auxiliary function
def parabola(a,n,r):
    """
    Creates a parabola around the position of a spike specified in 'a'
    Params:
        a - A vector specifying peak positions
        n - The length of the target vector to generate
        r - The radius of the parabola
    """
    assert n>2*r
    y = np.zeros(n, dtype = np.float32)
    x= np.array(range(2,2*r+1))
    for i in a:
        if i > r-1 and i <= n-r:
            y[i-r+1:i+r] = ((r+1)**2-(x-r-1)**2)/(r+1)**2
        elif i < r:
            y[:i+r] = ((r+1)**2-(x[r-i-1:]-r-1)**2)/(r+1)**2
        elif i<n:
            y[i-r+1:] = ((r+1)**2-(x[:r-1+(n-i)]-r-1)**2)/(r+1)**2
    return y

## Preprocessing the files

We iterate all the files and for each of them we read channels II and V1.

After reading the channels we separate them into two distinct arrays in order to then join them into one 1D array.

We filter out undesired lines, i.e., lines that do not have a QRS symbol specified in `qrs_symbs` list.

The negative positions in the `qrs_symbs` list are also filtered out.

Once the lines are filtered we create a parabola around the spikes of our labels to better identify them, this completes the preprocessing of the data and we are now ready to create an output dictionary to be serialized into a file with `pickle`

### Preprocessing of the training datafiles

In [3]:
for i in range(1, 76):
    
    file_path = f"./data/Training/I{i:02}"
    print(file_path)
    output_file_name = f"./processed_data/Training/I{i:02}"
    try:
        # Reading the channels of interest
        signal, info = wfdb.rdsamp(file_path, channel_names = ["II", "V1"])

        # Separating the two signals so we can put them in one dimension
        signal_II = signal[:, 0]
        signal_V1 = signal[:, 1]


        # Reading the annotations
        annotations = wfdb.rdann(file_path, "atr")
        symbol_positions = annotations.sample
        symbol_list = annotations.symbol

        # Filtering out all the lines that do not have a QRS symbol

        qrs_symbs = ['N','L','R','B','A','a','J','S','V','r','F','e','j', 'n', 'E', '/', 'f', 'Q',' ?']

        qrs_symbol_positions = [symbol_positions[idx] for idx, symb in enumerate(symbol_list) if symb in qrs_symbs]

        # Some peak positions are negative values, we are filtering these negative values out
        qrs_symbol_positions = [item for item in qrs_symbol_positions if item >= 0]
        
        target_vec = parabola(qrs_symbol_positions, len(signal), 3)

        output_dict = {
            "features": signal_II + signal_V1,
            "label": target_vec
        }

        pkl.dump(output_dict, open(output_file_name, "wb"), protocol=pkl.HIGHEST_PROTOCOL)
    except:
        print(f"Error on file {file_path}")

./data/Training/I01
./data/Training/I02
./data/Training/I03
./data/Training/I04
./data/Training/I05
./data/Training/I06
./data/Training/I07
./data/Training/I08
./data/Training/I09
./data/Training/I10
./data/Training/I11
./data/Training/I12
./data/Training/I13
./data/Training/I14
./data/Training/I15
./data/Training/I16
./data/Training/I17
./data/Training/I18
./data/Training/I19
./data/Training/I20
./data/Training/I21
./data/Training/I22
./data/Training/I23
./data/Training/I24
./data/Training/I25
./data/Training/I26
./data/Training/I27
./data/Training/I28
./data/Training/I29
./data/Training/I30
./data/Training/I31
./data/Training/I32
./data/Training/I33
./data/Training/I34
./data/Training/I35
./data/Training/I36
./data/Training/I37
./data/Training/I38
./data/Training/I39
./data/Training/I40
./data/Training/I41
./data/Training/I42
./data/Training/I43
./data/Training/I44
./data/Training/I45
./data/Training/I46
./data/Training/I47
./data/Training/I48
./data/Training/I49
./data/Training/I50


Example on how to read a file

In [5]:
data_dict = pkl.load(open("./processed_data/Training/I01", "rb"))
data_dict["features"]

array([-1.94444444, -1.97385621, -1.95751634, ...,  4.44117647,
        4.41176471,  4.45751634])

### Preprocessing of the test datafiles

In [14]:
nofile_indexes = [110, 120, 204, 206, 211, 216, 218, 229]
nofile_indexes = nofile_indexes + list(range(125,200)) + list(range(224,228))

for i in range(100, 235):
    
    if i not in nofile_indexes:
        file_path = f"./data/Test/{i}"
        output_file_name = f"./processed_data/Test/{i}"
        try:
            # Reading the channels of interest
            signal, info = wfdb.rdsamp(file_path)
            # Separating the two signals so we can put them in one dimension
            signal_II = signal[:, 0]
            signal_V1 = signal[:, 1]


            # Reading the annotations
            annotations = wfdb.rdann(file_path, "atr")
            symbol_positions = annotations.sample
            symbol_list = annotations.symbol
            
            # Filtering out all the lines that do not have a QRS symbol

            qrs_symbs = ['N','L','R','B','A','a','J','S','V','r','F','e','j', 'n', 'E', '/', 'f', 'Q',' ?']

            qrs_symbol_positions = [symbol_positions[idx] for idx, symb in enumerate(symbol_list) if symb in qrs_symbs]

            # Some peak positions are negative values, we are filtering these negative values out
            qrs_symbol_positions = [item for item in qrs_symbol_positions if item >= 0]

            target_vec = parabola(qrs_symbol_positions, len(signal), 3)

            output_dict = {
                "features": signal_II + signal_V1,
                "label": target_vec
            }

            pkl.dump(output_dict, open(output_file_name, "wb"), protocol=pkl.HIGHEST_PROTOCOL)
        except:
            #traceback.print_exc()
            print(f"Error on file {file_path}")

./data/Test/110


IndexError: list index out of range

In [20]:
wfdb.rdsamp("./data/Test/101")

(array([[-0.345, -0.16 ],
        [-0.345, -0.16 ],
        [-0.345, -0.16 ],
        ...,
        [-0.295, -0.11 ],
        [-0.29 , -0.11 ],
        [ 0.   ,  0.   ]]),
 {'fs': 360,
  'sig_len': 650000,
  'n_sig': 2,
  'base_date': None,
  'base_time': None,
  'units': ['mV', 'mV'],
  'sig_name': ['MLII', 'V1'],
  'comments': ['75 F 1011 654 x1', 'Diapres']})

In [25]:
wfdb.rdrecord("./data/Test/110").__dict__

IndexError: list index out of range