The following code is intended for selecting from SPH 12 lead ecg dataset the requiered parameters for developing algorithms for biological age estimation using ecg signals. The processing is made on SPH dataset metadata.csv file. The involve fetures are:
metadata['ECG_ID]
metadata['AHA_Code] == '1' (normal ecg)
metadata['Age] (or related features that could contains modifications related with intervals rather than values)

Import required packages

In [1]:
import numpy as np
import pandas as pd
import h5py

Read SHP dataset metadata.csv file in DataFrame form

In [2]:
ecg_metadata = pd.read_csv('E:/1-DENIS/Biomarkers/SPH dataset/metadata.csv')

In [3]:
ecg_metadata.head()

Unnamed: 0,ECG_ID,AHA_Code,Patient_ID,Age,Sex,N,Date
0,A00001,22;23,S00001,55,M,5000,2020-03-04
1,A00002,1,S00002,32,M,6000,2019-09-03
2,A00003,1,S00003,63,M,6500,2020-07-16
3,A00004,23,S00004,31,M,5000,2020-07-14
4,A00005,146,S00005,47,M,5500,2020-01-07


Method: remove non primary AHA_Code (for more information refer to SPH dataset related article)

In [4]:
def remove_nonprimary_code(x):
    """Remove non-primary statement"""
    r = []
    for cx in x:
        for c in cx.split('+'):
            if int(c) < 200 or int(c) >= 500:
                if c not in r:
                    r.append(c)
    return r

In [5]:
codes = ecg_metadata['AHA_Code'].str.split(';')
primary_codes = codes.apply(remove_nonprimary_code)

Include in the DataFrame the new column representing the primary codes (primary_codes)

In [6]:
ecg_metadata['primary_codes'] = primary_codes

In [7]:
ecg_metadata.head()

Unnamed: 0,ECG_ID,AHA_Code,Patient_ID,Age,Sex,N,Date,primary_codes
0,A00001,22;23,S00001,55,M,5000,2020-03-04,"[22, 23]"
1,A00002,1,S00002,32,M,6000,2019-09-03,[1]
2,A00003,1,S00003,63,M,6500,2020-07-16,[1]
3,A00004,23,S00004,31,M,5000,2020-07-14,[23]
4,A00005,146,S00005,47,M,5500,2020-01-07,[146]


Method: keep normal AHA_Code (AHA_Code == '1') and replace non normal AHA_Code for np.NaN

In [8]:
def get_code_1(x):
    r = []
    for cx in x:
        for c in cx:
            if c == '1' and len(cx) == 1:
                r.append(c)
            else:
                r = np.NaN
    return r

In [9]:
normal_codes = ecg_metadata.primary_codes.apply(get_code_1)

In [10]:
normal_codes

0        NaN
1        [1]
2        [1]
3        NaN
4        NaN
        ... 
25765    NaN
25766    NaN
25767    NaN
25768    NaN
25769    NaN
Name: primary_codes, Length: 25770, dtype: object

Include in the DataFrame the new column representing the normal codes (AHA_Code == '1') and NaN values for rest

In [11]:
ecg_metadata['normal_codes'] = normal_codes

In [12]:
ecg_metadata.head()

Unnamed: 0,ECG_ID,AHA_Code,Patient_ID,Age,Sex,N,Date,primary_codes,normal_codes
0,A00001,22;23,S00001,55,M,5000,2020-03-04,"[22, 23]",
1,A00002,1,S00002,32,M,6000,2019-09-03,[1],[1]
2,A00003,1,S00003,63,M,6500,2020-07-16,[1],[1]
3,A00004,23,S00004,31,M,5000,2020-07-14,[23],
4,A00005,146,S00005,47,M,5500,2020-01-07,[146],


In [13]:
ecg_normal = ecg_metadata.dropna(axis=0)

Verify count of AHA_Code == '1' in SPH 12 lead ecg dataset (Category == 'A', Code == '1', Primary Statement == 'Normal ECG', Count == 13905)

In [14]:
ecg_normal.shape

(13905, 9)

In [15]:
ecg_normal.reset_index(drop=True, inplace=True)

In [16]:
ecg_normal.head()

Unnamed: 0,ECG_ID,AHA_Code,Patient_ID,Age,Sex,N,Date,primary_codes,normal_codes
0,A00002,1,S00002,32,M,6000,2019-09-03,[1],[1]
1,A00003,1,S00003,63,M,6500,2020-07-16,[1],[1]
2,A00006,1,S00006,46,F,5000,2019-08-31,[1],[1]
3,A00008,1,S00008,32,M,5000,2019-10-02,[1],[1]
4,A00009,1,S00009,48,F,6000,2019-08-20,[1],[1]


In [17]:
ecg_normal.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13905 entries, 0 to 13904
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   ECG_ID         13905 non-null  object
 1   AHA_Code       13905 non-null  object
 2   Patient_ID     13905 non-null  object
 3   Age            13905 non-null  int64 
 4   Sex            13905 non-null  object
 5   N              13905 non-null  int64 
 6   Date           13905 non-null  object
 7   primary_codes  13905 non-null  object
 8   normal_codes   13905 non-null  object
dtypes: int64(2), object(7)
memory usage: 977.8+ KB


Method: conform new column of age clases based on intervals define on age feature and describe on SPH ecg dataset age distribution. Clases are labels from 0 to 8 ([10, 20), [20, 30)...[90, 100))

In [18]:
def define_age_intervals(i):
    if i >= 10 and i < 20:
        return 0
    elif i >= 20 and i < 30:
        return 1
    elif i >= 30 and i < 40:
        return 2
    elif i >= 40 and i < 50:
        return 3
    elif i >= 50 and i < 60:
        return 4
    elif i >= 60 and i < 70:
        return 5
    elif i >= 70 and i < 80:
        return 6
    elif i >= 80 and i < 90:
        return 7
    elif i >= 90 and i < 100:
        return 8
    else:
        return np.NaN

In [19]:
age_intervals = ecg_normal['Age'].apply(define_age_intervals)

In [20]:
age_intervals.name = 'Age_class'

In [21]:
age_intervals

0        2
1        5
2        3
3        2
4        3
        ..
13900    3
13901    6
13902    4
13903    1
13904    4
Name: Age_class, Length: 13905, dtype: int64

Conform dataframe with representative data

In [22]:
normal_ecg_age = pd.concat([ecg_normal['ECG_ID'], ecg_normal['Age'], age_intervals], axis=1)

In [23]:
normal_ecg_age.head()

Unnamed: 0,ECG_ID,Age,Age_class
0,A00002,32,2
1,A00003,63,5
2,A00006,46,3
3,A00008,32,2
4,A00009,48,3


Save normal_ecg_age dataframe in pickle format

In [24]:
# salvar normal_ecg_age en formato pickle
normal_ecg_age.to_pickle('normal_ecg_age.pickle')