# ECGs in ml4c3

ECGs in the ml4c3 pipeline are stored as hd5 (hdf5) files. We access data stored within each hd5 using TensorMaps.

In [None]:
import h5py
from tensormap.TensorMap import update_tmaps

TensorMaps access data in hd5 files. Read more about this class [in the docs](https://github.com/aguirre-lab/ml4c3/wiki/TensorMaps). Data are organized in one HD5 per patient, named after the ID number or medical record number (MRN). A single HD5 can contain multiple ECGs, which are located in the root level container , e.g. `hf['ecg']`. ECGs within the `/ecg` container are organized by the date-time the ECG was taken. In this example, patient "1234" has three ECGs from September 2020.

In [None]:
hd5 = h5py.File("1234.hd5", "r")
list(hd5['ecg'])

ml4c3 comes with many predefined tensor maps for multiple data modalities. The TensorMaps `ecg_2500` and `ecg_datetime` extract the ECG waveform as a 2500 sample vector and the recording time of each ECG. We use a helper function to collect a dictionary of TensorMaps.

In [None]:
tmaps = {}
update_tmaps("ecg_2500", tmaps)
update_tmaps("ecg_datetime", tmaps)
voltage_tm = tmaps["ecg_2500"]
ecgdate_tm = tmaps["ecg_datetime"]

Each TensorMap encodes both the semantic meaning of the feature it extracts and the means to extract the feature. The method of each TensorMap to extract data is `tensor_from_file`, taking both the TensorMap and the hd5 as arguments and returning a numpy array.

In [None]:
voltages = voltage_tm.tensor_from_file(voltage_tm, hd5)
ecgdates = ecgdate_tm.tensor_from_file(ecgdate_tm, hd5)

`ecg_2500` describes the extracted tensor as having shape `(2500, 12)`, 2500 samples across 12 leads. Because `tensor_from_file` of this TensorMap also extracts data from all ECGs in an hd5, the final shape of the tensor is `(3, 2500, 12)` for 3 ECGs.

In [None]:
print(
    f"The shape of each individual waveform described by the TensorMap is {voltage_tm.shape}.\n"
    f"The shape of the extracted tensor is {voltages.shape} because there are {len(ecgdates)} ECGs."
)

Of course, the returned numpy arrays can also be inspected

In [None]:
# voltages
ecgdates

This describes the basic use of TensorMaps. Below is an example of how a TensorMap might be implemented to return features from only the newest ECG for a given patient. Note the use of other attributes and methods encoded by the TensorMap.

In [None]:
import numpy as np
from tensormap.TensorMap import TensorMap, Interpretation, PatientData

def newest_patient_age_tensor_from_file(tm: TensorMap, hd5: PatientData) -> np.ndarray:
    ecg_dates = tm.time_series_filter(hd5)
    ecg_date = ecg_dates[-1]
    newest_age = hd5[f"{tm.path_prefix}/{ecg_date}/patientage"][()]
    newest_age = int(newest_age)
    tensor = np.array([newest_age])
    return tensor

newest_patient_age_tm = TensorMap(
    name="newest_ecg_patient_age",
    shape=(1,),
    interpretation=Interpretation.CONTINUOUS,
    tensor_from_file=newest_patient_age_tensor_from_file,
    path_prefix="ecg",
)

newest_patient_age_tm.tensor_from_file(newest_patient_age_tm, hd5)

This exact TensorMap already exists! `ecg_age_newest` returns the age of the patient from their most recent ECG.

Note the `tensor_from_file` method in the `TensorMap` object can encode any logic and execute arbitrary python. One example is to binarize the age of the newest ECG. Let's encode the binary feature as a 1-hot vector, where the `channel_map` defines the labels associated with each label. So given `channel_map={"lt_50": 0, "gte_50": 1}`, a vector of `[1, 0]` means the patient's age is less than 50 and `[0, 1]` means the patient's age is greater than or equal to 50.

In [None]:
def newest_binary_age_tensor_from_file(tm: TensorMap, hd5: PatientData) -> np.ndarray:
    ecg_dates = tm.time_series_filter(hd5)
    ecg_date = ecg_dates[-1]
    newest_age = hd5[f"{tm.path_prefix}/{ecg_date}/patientage"][()]
    newest_age = int(newest_age)

    ######
    if newest_age < 50:
        tensor = np.array([1, 0])
    else:
        tensor = np.array([0, 1])
    ######

    return tensor

newest_binary_age_tm = TensorMap(
    name="newest_ecg_binary_age_50",
    shape=(2,),
    channel_map={"lt_50": 0, "gte_50": 1},
    interpretation=Interpretation.CATEGORICAL,
    tensor_from_file=newest_binary_age_tensor_from_file,
    path_prefix="ecg",
)

newest_binary_age_tm.tensor_from_file(newest_patient_age_tm, hd5)

This is the full list of ECG TensorMaps that serve as building blocks for more sophisticated functionality. `ml4c3` generates many variations of these base TensorMaps, e.g. to obtain tensors from only the `_newest` ECGs, to cross-reference against other data sources, or to weigh the loss function.

In [None]:
from tensormap.ecg import tmaps
tmaps