# PhysioNet 2021 Challenge

The training data contains twelve-lead ECGs. The validation and test data contains twelve-lead, six-lead, four-lead, three-lead, and two-lead ECGs:

1. Twelve leads: I, II, III, aVR, aVL, aVF, V1, V2, V3, V4, V5, V6
2. Six leads: I, II, III, aVR, aVL, aVF
3. Four leads: I, II, III, V2
4. Three leads: I, II, V2
5. Two leads: I, II

Each ECG recording has one or more labels that describe cardiac abnormalities (and/or a normal sinus rhythm).

The Challenge data include annotated twelve-lead ECG recordings from six sources in four countries across three continents. These databases include over 100,000 twelve-lead ECG recordings with over 88,000 ECGs shared publicly as training data.

For example, a header file A0001.hea may have the following contents:

```
    A0001 12 500 7500
    A0001.mat 16+24 1000/mV 16 0 28 -1716 0 I
    A0001.mat 16+24 1000/mV 16 0 7 2029 0 II
    A0001.mat 16+24 1000/mV 16 0 -21 3745 0 III
    A0001.mat 16+24 1000/mV 16 0 -17 3680 0 aVR
    A0001.mat 16+24 1000/mV 16 0 24 -2664 0 aVL
    A0001.mat 16+24 1000/mV 16 0 -7 -1499 0 aVF
    A0001.mat 16+24 1000/mV 16 0 -290 390 0 V1
    A0001.mat 16+24 1000/mV 16 0 -204 157 0 V2
    A0001.mat 16+24 1000/mV 16 0 -96 -2555 0 V3
    A0001.mat 16+24 1000/mV 16 0 -112 49 0 V4
    A0001.mat 16+24 1000/mV 16 0 -596 -321 0 V5
    A0001.mat 16+24 1000/mV 16 0 -16 -3112 0 V6
    #Age: 74
    #Sex: Male
    #Dx: 426783006
    #Rx: Unknown
    #Hx: Unknown
    #Sx: Unknown
```

From the first line of the file:
- We see that the recording number is A0001, and the recording file is A0001.mat. 
- The recording has 12 leads, each recorded at a 500 Hz sampling frequency, and contains 7500 samples. 
- From the next 12 lines of the file (one for each lead), we see that each signal:
    - Was written at 16 bits with an offset of 24 bits
    - The floating point number (analog-to-digital converter (ADC) units per physical unit) is 1000/mV 
    - The resolution of the analog-to-digital converter (ADC) used to digitize the signal is 16 bits, and the baseline value corresponding to 0 physical units is 0. 
    - The first value of the signal (-1716, etc.), the checksum (0, etc.), and the lead name (I, etc.) are the last three entries of each of these lines. 
- From the final 6 lines, we see that the patient is:
    - A 74-year-old male 
    - With a diagnosis (Dx) of 426783006, which is the **SNOMED-CT code** for sinus rhythm. 
    - The medical prescription (Rx), history (Hx), and symptom or surgery (Sx) are unknown. 

- Please visit WFDB header format for more information on the header file and variables.

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
from tqdm import tqdm
from scipy.signal import resample
import torch
from transformers import AutoTokenizer, AutoModel
import ast

c:\Users\navme\AppData\Local\Programs\Python\Python38\lib\site-packages\numpy\.libs\libopenblas.FB5AE2TYXYH2IJRDKGDGQ3XBKLKTF43H.gfortran-win_amd64.dll
c:\Users\navme\AppData\Local\Programs\Python\Python38\lib\site-packages\numpy\.libs\libopenblas64__v0.3.21-gcc_10_3_0.dll


In [2]:
sys.path.append('C:/Users/navme/Desktop/ECG_Project/PyFiles')

In [3]:
from helper_functions import *

In [4]:
from dataset import PhysioNetDataset

In [5]:
PhysioNet_PATH = f'C:/Users/navme/Desktop/ECG_Thesis_Local/PhysioNet-2021-Challenge/physionet.org/files/challenge-2021/1.0.3/training'
PhysioNet_PATH

'C:/Users/navme/Desktop/ECG_Thesis_Local/PhysioNet-2021-Challenge/physionet.org/files/challenge-2021/1.0.3/training'

## Train/Val/Test PhysioNet Datasets 

In [6]:
# Train
train_set = PhysioNetDataset(PhysioNet_PATH, train=True)

# Val
val_set = PhysioNetDataset(PhysioNet_PATH, train=False)

In [24]:
# Example of ECG header data
train_set[0][0]

{'recording_number': 'JS00001',
 'recording_file': 'JS00001.mat',
 'num_leads': 12,
 'sampling_frequency': 500,
 'num_samples': 5000,
 'leads_info': [{'file': 'JS00001.mat',
   'adc_gain': 1000.0,
   'units': 'mV',
   'adc_resolution': 16,
   'adc_zero': 0,
   'initial_value': -254,
   'checksum': 21756,
   'lead_name': '0'},
  {'file': 'JS00001.mat',
   'adc_gain': 1000.0,
   'units': 'mV',
   'adc_resolution': 16,
   'adc_zero': 0,
   'initial_value': 264,
   'checksum': -599,
   'lead_name': '0'},
  {'file': 'JS00001.mat',
   'adc_gain': 1000.0,
   'units': 'mV',
   'adc_resolution': 16,
   'adc_zero': 0,
   'initial_value': 517,
   'checksum': -22376,
   'lead_name': '0'},
  {'file': 'JS00001.mat',
   'adc_gain': 1000.0,
   'units': 'mV',
   'adc_resolution': 16,
   'adc_zero': 0,
   'initial_value': -5,
   'checksum': 28232,
   'lead_name': '0'},
  {'file': 'JS00001.mat',
   'adc_gain': 1000.0,
   'units': 'mV',
   'adc_resolution': 16,
   'adc_zero': 0,
   'initial_value': -386,


In [8]:
# Example of ECG signal
train_set[0][1]

{'val': array([[408.24601882, 408.24601882, 408.24601882, ..., -83.34581329,
         -74.965045  , -63.10339951],
        [-92.07603073, -92.07603073, -92.07603073, ...,  57.20010276,
          54.51591647,  58.88514819],
        [225.08001192, 225.08001192, 225.08001192, ...,  93.39571052,
          97.44912853, 117.96825132]])}

## Loading Processed Patient Data

### PhysioNet 2021

In [9]:
processed_train_df_path = r'C:\Users\navme\Desktop\ECG_Project\Data\PhysioNet\processed_train_set_records.csv'
processed_val_df_path = r'C:\Users\navme\Desktop\ECG_Project\Data\PhysioNet\processed_val_set_records.csv'

# Fix URL formatting
processed_train_df_path = convert_to_forward_slashes(processed_train_df_path)
processed_val_df_path = convert_to_forward_slashes(processed_val_df_path)

In [10]:
processed_train_df = pd.read_csv(processed_train_df_path)
processed_val_df = pd.read_csv(processed_val_df_path)

In [28]:
processed_train_df.head()

Unnamed: 0,recording_number,recording_file,num_leads,sampling_frequency,num_samples,age,sex,dx,rx,hx,...,lead_10_lead_name,lead_11_file,lead_11_adc_gain,lead_11_units,lead_11_adc_resolution,lead_11_adc_zero,lead_11_initial_value,lead_11_checksum,lead_11_lead_name,dx_modality
0,JS00001,JS00001.mat,12,500,5000,85.0,Male,"['164889003', '59118001', '164934002']",Unknown,Unknown,...,0,JS00001.mat,1000.0,mV,16,0,527,32579,0,"['atrial fibrillation', 'right bundle branch b..."
1,JS00002,JS00002.mat,12,500,5000,59.0,Female,"['426177001', '164934002']",Unknown,Unknown,...,0,JS00002.mat,1000.0,mV,16,0,0,31542,0,"['sinus bradycardia', 't wave abnormal']"
2,JS00004,JS00004.mat,12,500,5000,66.0,Male,['426177001'],Unknown,Unknown,...,0,JS00004.mat,1000.0,mV,16,0,342,24967,0,['sinus bradycardia']
3,JS00005,JS00005.mat,12,500,5000,73.0,Female,"['164890007', '429622005', '428750005']",Unknown,Unknown,...,0,JS00005.mat,1000.0,mV,16,0,-176,-8125,0,"['atrial flutter', 'st depression', 'nonspecif..."
4,JS00006,JS00006.mat,12,500,5000,46.0,Female,['426177001'],Unknown,Unknown,...,0,JS00006.mat,1000.0,mV,16,0,-98,-11358,0,['sinus bradycardia']


### CODE-15

In [12]:
CODE15_df_path = r'C:\Users\navme\Desktop\ECG_Project\Data\CODE-15\exams.csv'

# Fix URL formatting
CODE15_df_path = convert_to_forward_slashes(CODE15_df_path)

In [13]:
CODE15_df = pd.read_csv(CODE15_df_path)

In [14]:
CODE15_df.head(5)

Unnamed: 0,exam_id,age,is_male,nn_predicted_age,1dAVb,RBBB,LBBB,SB,ST,AF,patient_id,death,timey,normal_ecg,trace_file
0,1169160,38,True,40.160484,False,False,False,False,False,False,523632,False,2.098628,True,exams_part13.hdf5
1,2873686,73,True,67.05944,False,False,False,False,False,False,1724173,False,6.657529,False,exams_part13.hdf5
2,168405,67,True,79.62174,False,False,False,False,False,True,51421,False,4.282188,False,exams_part13.hdf5
3,271011,41,True,69.75026,False,False,False,False,False,False,1737282,False,4.038353,True,exams_part13.hdf5
4,384368,73,True,78.87346,False,False,False,False,False,False,331652,False,3.786298,False,exams_part13.hdf5


## TextEncoder()

Create a class, ```TextEncoder()``` that is used to convert the description of the (dx_modality) diagnosis class into embeddings using the ClinicalBERT model.

- Input should be a concatenated using comma or blank space string of diagnoses/dx_modality per ECG signal.
- Use processed CSV files (dx_modality vs dx_modality, age, etc together)
- Frozen weights (since it's already pretrained)

## PhysioNet Data

### Case 1: dx_modality only

In [15]:
class TextEncoder:
    def __init__(self):
        self.tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
        self.model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

    def encode(self, text_list):
        # Check if text_list is a string representation of a list
        if isinstance(text_list, str):
            text_list = ast.literal_eval(text_list)
        # Convert list of strings to a single string
        text = ', '.join(text_list)
        # Tokenize text
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
        # Get embeddings from ClinicalBERT model
        with torch.no_grad():
            embeddings = self.model(**inputs).last_hidden_state
        # Average the embeddings to get single vector per each input
        embeddings = torch.mean(embeddings, dim=1)
        return embeddings

In [18]:
if isinstance(processed_train_df['dx_modality'][4], str):
    print('yes')
else:
    print('no')

yes


In [16]:
# Example of TextEncoder
encoder = TextEncoder()
embeddings = encoder.encode(processed_train_df['dx_modality'][0])

In [25]:
# Check size of the embeddings
print(embeddings.size())

torch.Size([1, 768])


### Case 2: dx_modality plus age, sex, etc

In [35]:
class TextEncoder:
    def __init__(self):
        self.tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
        self.model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

    def encode(self, series):
        text = f"{series['age']}, {series['sex']}, {series['dx_modality']}"
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
        with torch.no_grad():
            embeddings = self.model(**inputs).last_hidden_state
        embeddings = torch.mean(embeddings, dim=1)
        return embeddings

In [37]:
encoder = TextEncoder()
embeddings = encoder.encode(processed_train_df.iloc[0])

In [38]:
print(embeddings.size())

torch.Size([1, 768])


### CODE-15 Data

## ECGEncoder() 

- Input is ECG signal, output will be embeddings of ECG signal
- This is going to be model in model.py 
- Model weights are updated iteratively
- optimizer = torch.optim.Adam(clip_model.ECGEncoder.parameters())