HDF_dataset_adds_on is intended for adding feature values to the HDF files, in a way to eliminate further computations. Features are added in the form of dataframes. Each dataframe has the structure [rows: the readable ECG leads, columns: feature values over the ECG ROI's]

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import h5py
import neurokit2 as nk

In [42]:
normal_ecg_age = pd.read_pickle('normal_ecg_age.pickle')
normal_ecg_age

Unnamed: 0,ECG_ID,Age,Age_class_0,Age_class_1,Age_class_2,Age_class_3
0,A00002,32,2,1,0,0
1,A00003,63,5,2,1,1
2,A00006,46,3,1,1,0
3,A00008,32,2,1,0,0
4,A00009,48,3,1,1,0
...,...,...,...,...,...,...
13900,A25755,44,3,1,1,0
13901,A25756,76,6,3,2,1
13902,A25757,55,4,2,1,1
13903,A25764,20,1,0,0,0


Create h5py dataframe of R peaks: f['ECG_R_Peaks'][...]. Procedure also creates pickles files containing:
 - ecg_signal_read_error: pickle file containing ['ECG_ID', 'Derivation'] of signal read error OSError.
 - ecg_multiple_r_peaks_detection: pickle file containing ['ECG_ID', 'Derivation'] of signal with multiple R Peaks detection. (Un apagon interrumpio su calculo)
 - ecg_r_peaks_missing: pickle file containing ['ECG_ID', 'Derivation'] of signal with miss detection and -1 included because NaN is of type float.

Dataframe in the form rows: derivations, columns: ECG_R_Peaks fiducial points.

In [None]:
ecg_signal_read_error = pd.DataFrame(columns=['ECG_ID', 'Derivation'])
ecg_multiple_r_peaks_detection = pd.DataFrame(columns=['ECG_ID', 'Derivation'])
ecg_r_peaks_missing = pd.DataFrame(columns=['ECG_ID', 'Derivation'])
for id in normal_ecg_age['ECG_ID'][12344:]:
    with h5py.File(f'E:/1-DENIS/Biomarkers/SPH dataset/records/{id}.h5', 'r+') as f:
        for der in range (12):
            try:
                ecg_sig = f['ecg'][der]
            except OSError:
                print(f'ECG signal: {id}, derivation: {der}. Couldnot be read')
                ecg_signal_read_error.loc[len(ecg_signal_read_error)] = [id, der]
                continue
            ecg_fixed, is_inverted = nk.ecg_invert(ecg_sig, sampling_rate=500)
            if is_inverted:
                ecg_sig = ecg_fixed    
            signals, _ = nk.ecg_process(ecg_sig, sampling_rate=500)
            roi_ref = list(signals[signals['ECG_R_Peaks'] == 1].index)
            if der == 0:
                ECG_R_Peaks_dataframe = pd.DataFrame(columns=[c for c in range(len(roi_ref))])            
            else:
                while len(roi_ref) > len(ECG_R_Peaks_dataframe.columns):
                    interval_difference = [0] * (len(roi_ref) - 1)
                    for i in range(len(roi_ref) - 1):
                        interval_difference[i] = roi_ref[i + 1] - roi_ref[i]
                    index_min_interval = interval_difference.index(min(interval_difference)) + 1
                    ecg_multiple_r_peaks_detection.loc[len(ecg_multiple_r_peaks_detection)] = [id, der]
                    roi_ref.pop(index_min_interval)
                while len(roi_ref) < len(ECG_R_Peaks_dataframe.columns):
                    roi_ref.append(-1)
                    ecg_r_peaks_missing.loc[len(ecg_r_peaks_missing)] = [id, der]
            ECG_R_Peaks_dataframe.loc[len(ECG_R_Peaks_dataframe)] = roi_ref
        #ECG_R_Peaks_dataframe.index = [['I', 'II', 'III', 'aVR', 'aVL', 'aVF', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6']]
        f['ECG_R_Peaks'] = ECG_R_Peaks_dataframe
        del ECG_R_Peaks_dataframe
        f.close()
ecg_signal_read_error.to_pickle('ecg_signal_read_error.pickle')
ecg_multiple_r_peaks_detection.to_pickle('ecg_multiple_r_peaks_detection.pickle')
ecg_r_peaks_missing.to_pickle('ecg_r_peaks_missing.pickle')

Because blackout running was interrupted and pickle files were not correctly created. The folloeing procedure is to create pickle files related with ECG read error and missing R peaks. The statistics related with multiple R peaks detection needs to be calculated on running time and the process is tedious (+24 hours).

In [None]:
ecg_signal_read_error = pd.DataFrame(columns=['ECG_ID', 'Derivation'])
ecg_r_peaks_missing = pd.DataFrame(columns=['ECG_ID', 'Derivation'])
for id in normal_ecg_age['ECG_ID']:
    with h5py.File(f'E:/1-DENIS/Biomarkers/SPH dataset/records/{id}.h5', 'r+') as f:
        r_peaks = f['ECG_R_Peaks'][...]
        row, col = r_peaks.shape
        for peaks_list in r_peaks:
            for c in range(col):
                if peaks_list[c] == -1:
                    ecg_r_peaks_missing.loc[len(ecg_r_peaks_missing)] = [id, der]
        for der in range (12):
            try:
                ecg_sig = f['ecg'][der]
            except OSError:
                print(f'ECG signal: {id}, derivation: {der}. Couldnot be read')
                ecg_signal_read_error.loc[len(ecg_signal_read_error)] = [id, der]
                continue      
        f.close()
ecg_signal_read_error.to_pickle('ecg_signal_read_error.pickle')
ecg_r_peaks_missing.to_pickle('ecg_r_peaks_missing.pickle')

In [70]:
ecg_r_peaks_missing

Unnamed: 0,ECG_ID,Derivation
0,A00038,11
1,A00038,11
2,A00038,11
3,A00038,11
4,A00038,11
...,...,...
10803,A25751,11
10804,A25751,11
10805,A25751,11
10806,A25751,11


Following procedure creates h5py dataframe of Katz fractal dimension values obtained around detected ECG R Peaks (ROI = [-150 + ECG_R_Peaks: 150 + ECG_R_Peaks]). 

Dataframe in the form rows: derivations, columns: Katz fractal dimension values for each ECG ROI

In [None]:
count_wrong_read_signal = 0
count_missing_peaks = 0
for id in normal_ecg_age['ECG_ID']:
    with h5py.File(f'E:/1-DENIS/Biomarkers/SPH dataset/records/{id}.h5', 'r+') as f:
        r_peaks = f['ECG_R_Peaks'][...]
        row,col = r_peaks.shape
        Katz_DataFrame = pd.DataFrame(columns=[c for c in range(col)])
        Katz_list = [np.NaN]*col
        index = range(12)
        for index, r in zip(index, r_peaks):
            try:
                signal = f['ecg'][index]
            except OSError:
                count_wrong_read_signal+=1
                continue
            for c in range(col):
                if r[c] != -1:
                    Katz_list[c],_ = nk.fractal_katz(signal[r[c] - 150:r[c] + 150])
                else:
                    count_missing_peaks+=1
                    Katz_list[c] = Katz_list[c - 1]
            Katz_DataFrame.loc[len(Katz_DataFrame)] = Katz_list
            Katz_list = [np.NaN]*col
        mn = pd.Series(np.mean(Katz_DataFrame, axis=1))
        mn.name = col
        std = pd.Series(np.std(Katz_DataFrame, axis=1))
        std.name = col + 1
        Katz_DataFrame = pd.concat([Katz_DataFrame, mn, std], axis=1)
        f['Katz_fractal'] = Katz_DataFrame
        del Katz_DataFrame, r_peaks, mn, std
        f.close()
#print('Cannot read de signal:', count_wrong_read_signal) size of r_peaks don't let code to rest of derivations
print('Missing R peaks:', count_missing_peaks)

Find the derivations with minor Katz fractal dimension standard deviation

In [26]:
derivations_list = [] 
for id in normal_ecg_age['ECG_ID']:
    with h5py.File(f'E:/1-DENIS/Biomarkers/SPH dataset/records/{id}.h5', 'r+') as f:
        k = f['Katz_fractal'][:,-1]
        derivations_list.append(np.argmin(k))
np.save('derivations_list.npy', derivations_list)

In [27]:
unique, counts = np.unique(derivations_list, return_counts=True)
print(np.asarray((unique, counts)).T)

[[   0  569]
 [   1  876]
 [   2  303]
 [   3  711]
 [   4  118]
 [   5  518]
 [   6  802]
 [   7 2107]
 [   8 1572]
 [   9 2473]
 [  10 2420]
 [  11 1436]]


Derivation ['I', 'II', 'III', 'aVR', 'aVL', 'aVF', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6'] with minor standard deviation: 9 corresponding to V4, 10 corresponding to V5, 7 corresponding to V2

In [4]:
Katz_mean_457 = pd.DataFrame(columns=['ECG_ID', 'Katz_mean_V4', 'Katz_mean_V5', 'Katz_mean_V2'])
for id in normal_ecg_age['ECG_ID']:
    with h5py.File(f'E:/1-DENIS/Biomarkers/SPH dataset/records/{id}.h5', 'r+') as f:
        try:
            Katz_mean_457.loc[len(Katz_mean_457)] = [id, f['Katz_fractal'][9,-2], f['Katz_fractal'][10,-2], f['Katz_fractal'][7,-2]]
        except IndexError:
            continue
Katz_mean_457.to_pickle('Katz_mean_457.pickle')

In [5]:
Katz_mean_V457

Unnamed: 0,ECG_ID,Katz_mean_V4,Katz_mean_V5,Katz_mean_V2
0,A00002,1.256836,1.205078,1.245117
1,A00003,1.423828,1.388672,1.340820
2,A00006,1.310547,1.313477,1.375000
3,A00008,1.284180,1.262695,1.316406
4,A00009,1.292969,1.264648,1.320312
...,...,...,...,...
13862,A25755,1.243164,1.220703,1.407227
13863,A25756,1.233398,1.227539,1.317383
13864,A25757,1.362305,1.320312,1.367188
13865,A25764,1.291992,1.266602,1.439453


In [44]:
ecg_signal_read_error = pd.read_pickle('ecg_signal_read_error.pickle')
ecg_signal_read_error[ecg_signal_read_error['Derivation']==9].count()

ECG_ID        38
Derivation    38
dtype: int64

In [49]:
len(Katz_mean_V4) + ecg_signal_read_error[ecg_signal_read_error['Derivation']==9].count()

ECG_ID        13905
Derivation    13905
dtype: int64

In [10]:
Katz_mean_V457_age = pd.merge(Katz_mean_V457,normal_ecg_age, on=['ECG_ID', 'ECG_ID'])

In [11]:
Katz_mean_V457_age

Unnamed: 0,ECG_ID,Katz_mean_V4,Katz_mean_V5,Katz_mean_V2,Age,Age_class_0,Age_class_1,Age_class_2,Age_class_3
0,A00002,1.256836,1.205078,1.245117,32,2,1,0,0
1,A00003,1.423828,1.388672,1.340820,63,5,2,1,1
2,A00006,1.310547,1.313477,1.375000,46,3,1,1,0
3,A00008,1.284180,1.262695,1.316406,32,2,1,0,0
4,A00009,1.292969,1.264648,1.320312,48,3,1,1,0
...,...,...,...,...,...,...,...,...,...
13862,A25755,1.243164,1.220703,1.407227,44,3,1,1,0
13863,A25756,1.233398,1.227539,1.317383,76,6,3,2,1
13864,A25757,1.362305,1.320312,1.367188,55,4,2,1,1
13865,A25764,1.291992,1.266602,1.439453,20,1,0,0,0


In [12]:
Katz_mean_V457_age['Age_class_1'].unique()

array([1, 2, 0, 3, 4], dtype=int64)

In [14]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(Katz_mean_V457_age[['Katz_mean_V4', 'Katz_mean_V5', 'Katz_mean_V2']], Katz_mean_V457_age['Age_class_1'], random_state=0)

In [15]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))

Test set score: 0.42


In [16]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(tree.score(X_test, y_test)))

Accuracy on training set: 1.00
Accuracy on test set: 0.37


In [None]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}".format(tree.score(X_train, y_train)))
print("Accuracy on test set: {:.2f}".format(tree.score(X_test, y_test)))

Conform Katz_mean pickle file formed by the mean of the Katz fractal dimension values for the 12 leads.

In [33]:
Katz_mean = pd.DataFrame(columns=['ECG_ID','I_mean', 'II_mean', 'III_mean', 'aVR_mean', 'aVL_mean', 'aVF_mean', 'V1_mean', 'V2_mean', 'V3_mean', 'V4_mean', 'V5_mean', 'V6_mean'])
for id in normal_ecg_age['ECG_ID']:
    with h5py.File(f'E:/1-DENIS/Biomarkers/SPH dataset/records/{id}.h5', 'r+') as f:
        try:
            Katz_mean.loc[len(Katz_mean)] = [id, f['Katz_fractal'][0,-2], f['Katz_fractal'][1,-2], f['Katz_fractal'][2,-2], f['Katz_fractal'][3,-2], f['Katz_fractal'][4,-2], f['Katz_fractal'][5,-2], f['Katz_fractal'][6,-2], f['Katz_fractal'][7,-2], f['Katz_fractal'][8,-2], f['Katz_fractal'][9,-2], f['Katz_fractal'][10,-2], f['Katz_fractal'][11,-2]]
        except IndexError:
            continue
Katz_mean.to_pickle('Katz_mean.pickle')

In [34]:
Katz_mean

Unnamed: 0,ECG_ID,I_mean,II_mean,III_mean,aVR_mean,aVL_mean,aVF_mean,V1_mean,V2_mean,V3_mean,V4_mean,V5_mean,V6_mean
0,A00002,1.248047,1.194336,1.248047,1.210938,1.367188,1.192383,1.194336,1.245117,1.377930,1.256836,1.205078,1.176758
1,A00003,1.921875,1.493164,1.545898,1.659180,1.859375,1.464844,1.563477,1.340820,1.361328,1.423828,1.388672,1.398438
2,A00006,1.668945,1.635742,1.791992,1.575195,1.916992,1.703125,1.440430,1.375000,1.408203,1.310547,1.313477,1.321289
3,A00008,1.495117,1.263672,1.261719,1.290039,1.298828,1.256836,1.287109,1.316406,1.347656,1.284180,1.262695,1.236328
4,A00009,1.277344,1.322266,1.553711,1.292969,1.284180,1.416992,1.222656,1.320312,1.398438,1.292969,1.264648,1.253906
...,...,...,...,...,...,...,...,...,...,...,...,...,...
13862,A25755,1.231445,1.244141,1.421875,1.225586,1.266602,1.364258,1.343750,1.407227,1.283203,1.243164,1.220703,1.213867
13863,A25756,1.351562,1.333008,1.667969,1.324219,1.491211,1.391602,1.355469,1.317383,1.256836,1.233398,1.227539,1.234375
13864,A25757,1.501953,1.314453,1.443359,1.357422,1.698242,1.337891,1.388672,1.367188,1.367188,1.362305,1.320312,1.290039
13865,A25764,1.392578,1.285156,1.279297,1.320312,1.367188,1.271484,1.397461,1.439453,1.411133,1.291992,1.266602,1.251953


In [38]:
Katz_mean.shape[1]

13

Following procedure creates h5py dataframe of Higuchi fractal dimension values obtained around detected ECG R Peaks (ROI = [-150 + ECG_R_Peaks: 150 + ECG_R_Peaks]). 

Dataframe in the form rows: derivations, columns: Katz fractal dimension values for each ECG ROI

In [40]:
count_missing_peaks = 0
for id in normal_ecg_age['ECG_ID']:
    with h5py.File(f'E:/1-DENIS/Biomarkers/SPH dataset/records/{id}.h5', 'r+') as f:
        r_peaks = f['ECG_R_Peaks'][...]
        col = r_peaks.shape[1]
        Line_length_DataFrame = pd.DataFrame(columns=[c for c in range(col)])
        Line_length_list = [np.NaN]*col
        index = range(12)
        for index, r in zip(index, r_peaks):
            try:
                signal = f['ecg'][index]
            except OSError:
                continue
            for c in range(col):
                if r[c] != -1:
                    Line_length_list[c],_ = nk.fractal_linelength(signal[r[c] - 150:r[c] + 150])
                else:
                    count_missing_peaks+=1
                    Line_length_list[c] = Line_length_list[c - 1]
            Line_length_DataFrame.loc[len(Line_length_DataFrame)] = Line_length_list
            Line_length_list = [np.NaN]*col
        mn = pd.Series(np.mean(Line_length_DataFrame, axis=1))
        mn.name = col
        std = pd.Series(np.std(Line_length_DataFrame, axis=1))
        std.name = col + 1
        Line_length_DataFrame = pd.concat([Line_length_DataFrame, mn, std], axis=1)
        f['Line_length_fractal'] = Line_length_DataFrame
        del Line_length_DataFrame, r_peaks, mn, std
        f.close()
print('Missing R peaks:', count_missing_peaks)

Missing R peaks: 10808


Find the derivations with minor line length fractal dimension standard deviation

In [43]:
derivations_list_line_length = [] 
for id in normal_ecg_age['ECG_ID']:
    with h5py.File(f'E:/1-DENIS/Biomarkers/SPH dataset/records/{id}.h5', 'r+') as f:
        ll = f['Line_length_fractal'][:,-1]
        derivations_list_line_length.append(np.argmin(ll))
np.save('derivations_list_line_length.npy', derivations_list_line_length)

In [44]:
unique, counts = np.unique(derivations_list_line_length, return_counts=True)
print(np.asarray((unique, counts)).T)

[[   0 2741]
 [   1 2813]
 [   2 1168]
 [   3 3153]
 [   4 1324]
 [   5  640]
 [   6  726]
 [   7  387]
 [   8  308]
 [   9  100]
 [  10  169]
 [  11  376]]


Derivation ['I', 'II', 'III', 'aVR', 'aVL', 'aVF', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6'] with minor standard deviation: 3 corresponding to aVR, 1 corresponding to II, 0 corresponding to I

In [None]:
Line_length_mean_310 = pd.DataFrame(columns=['ECG_ID', 'Ll_mean_aVR', 'Ll_mean_II', 'Ll_mean_I'])
for id in normal_ecg_age['ECG_ID']:
    with h5py.File(f'E:/1-DENIS/Biomarkers/SPH dataset/records/{id}.h5', 'r+') as f:
        try:
            Line_length_mean_310.loc[len(Line_length_mean_310)] = [id, f['Line_length_fractal'][3,-2], f['Line_length_fractal'][1,-2], f['Line_length_fractal'][0,-2]]
        except IndexError:
            continue
Line_length_mean_310.to_pickle('Line_length_mean_310.pickle')

Conform Katz_mean pickle file formed by the mean of the line length fractal dimension values for the 12 leads.

In [45]:
Line_length_mean = pd.DataFrame(columns=['ECG_ID','I_mean', 'II_mean', 'III_mean', 'aVR_mean', 'aVL_mean', 'aVF_mean', 'V1_mean', 'V2_mean', 'V3_mean', 'V4_mean', 'V5_mean', 'V6_mean'])
for id in normal_ecg_age['ECG_ID']:
    with h5py.File(f'E:/1-DENIS/Biomarkers/SPH dataset/records/{id}.h5', 'r+') as f:
        try:
            Line_length_mean.loc[len(Line_length_mean)] = [id, f['Line_length_fractal'][0,-2], f['Line_length_fractal'][1,-2], f['Line_length_fractal'][2,-2], f['Line_length_fractal'][3,-2], f['Line_length_fractal'][4,-2], f['Line_length_fractal'][5,-2], f['Line_length_fractal'][6,-2], f['Line_length_fractal'][7,-2], f['Line_length_fractal'][8,-2], f['Line_length_fractal'][9,-2], f['Line_length_fractal'][10,-2], f['Line_length_fractal'][11,-2]]
        except IndexError:
            continue
Line_length_mean.to_pickle('Line_length_mean.pickle')