# Main Thesis Topic: “Zero-shot classification of ECG signals using CLIP-like model”. 

**For example: Train on PBT-XL:** 

- Text Encoder: ClinicalBERT (trained on diagnoses of ECG signal to obtain corresponding embeddings)
- Image Encoder: 1D-CNN (used to encode ECG signal to obtain signal embeddings)

- Experiment A): Baseline: We can take only the name of the class. For example, take “Myocardial Infarction” as a text. We should exclude some classes from training and after training is completed, the CLIP-like model can be tested on these excluded classes. 
    - Next, we get embeddings of text from ClinicalBERT and train the ECG encoder with contrastive loss.

- Experiment B): Same as Experiment A but instead of testing on the same dataset/classes, we would test on other datasets containing different classes.

**Evaluation metrics:** 
- Main: AUC-ROC, average_precison_score, 
- Optional: Specificity, Sensitivity, F1-score 

**Outcome:** 
- It’s possible to train CLIP-like models with freezed (or unchanged/not fine tuned for downstream tasks) text encoder
- Training ECG encoders that are viable for representing different domains (within ECG modality) and previously unseen classes. 
- Training a CLIP-like model on ECGs has little novelty.

# Multi-Class ECG Classifier

First, we preprocess the ECG data from the PhysioNet 2021 challenge dataset. This data will be loaded using the ```PhysioNetDataset``` class.

In [23]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
from tqdm import tqdm
from scipy.signal import resample
import torch
from transformers import AutoTokenizer, AutoModel
import ast
import scipy.io as sio
from torch.utils.data import random_split

In [2]:
sys.path.append('C:/Users/navme/Desktop/ECG_Project/PyFiles')

In [3]:
from helper_functions import *
from dataset import *

In [4]:
# Path to dir/PhysioNet-2021-Challenge/physionet.org/files/challenge-2021/1.0.3/training
PhysioNet_PATH = f'C:/Users/navme/Desktop/ECG_Thesis_Local/PhysioNet-2021-Challenge/physionet.org/files/challenge-2021/1.0.3/training'
PhysioNet_PATH

'C:/Users/navme/Desktop/ECG_Thesis_Local/PhysioNet-2021-Challenge/physionet.org/files/challenge-2021/1.0.3/training'

Using the ```PhysioNet_PATH```, we can create separate datasets for training, testing & validation.

# Data Preprocessing 

- train_set (train & validation data)
- test_set (test data)

In [8]:
train_set = PhysioNetDataset(PhysioNet_PATH, train=True)
test_set = PhysioNetDataset(PhysioNet_PATH, train=False)

len(train_set), len(test_set)

(65900, 22352)

The ```train_set``` can be split into ```current_train``` and ```current_val```. 

In [31]:
# Set the seed for the random number generator
torch.manual_seed(0)

# Get the length of the train_set
length = len(train_set)

# Calculate the lengths of the splits
train_length = int(0.85 * length)
val_length = length - train_length

# Split the dataset
current_train, current_val = random_split(train_set, [train_length, val_length])

The next step is to extract the header data for ```current_train```, ```current_val```, and ```test_set``` and save the data to a csv file. 

## current_train

In [32]:
# Initialize an empty list to store the records
records = []

# Iterate over all records
for i in tqdm(range(len(current_train)), desc="Processing records"):
    record, _ = train_set[i]  # Get the record (ignore the ECG data for now)
    
    # Flatten the 'leads_info' list into separate columns for each lead
    for j, lead_info in enumerate(record['leads_info']):
        for key, value in lead_info.items():
            record[f'lead_{j}_{key}'] = value
    del record['leads_info']  # We don't need the 'leads_info' list anymore

    # Append the record to the list
    records.append(record)

# Convert the list of records into a DataFrame
df = pd.DataFrame(records)

# Save the DataFrame to a CSV file
df.to_csv('train_set_records.csv', index=False)

print(f"Processed {len(records)} records.")

Processing records: 100%|██████████| 56015/56015 [14:35<00:00, 63.96it/s] 


Processed 56015 records.


## current_val

In [33]:
# Initialize an empty list to store the records
records = []

# Iterate over all records
for i in tqdm(range(len(current_val)), desc="Processing records"):
    record, _ = train_set[i]  # Get the record (ignore the ECG data for now)
    
    # Flatten the 'leads_info' list into separate columns for each lead
    for j, lead_info in enumerate(record['leads_info']):
        for key, value in lead_info.items():
            record[f'lead_{j}_{key}'] = value
    del record['leads_info']  # We don't need the 'leads_info' list anymore

    # Append the record to the list
    records.append(record)

# Convert the list of records into a DataFrame
df = pd.DataFrame(records)

# Save the DataFrame to a CSV file
df.to_csv('val_set_records.csv', index=False)

print(f"Processed {len(records)} records.")

Processing records: 100%|██████████| 9885/9885 [00:15<00:00, 634.81it/s]


Processed 9885 records.


## test_set

In [34]:
# Initialize an empty list to store the records
records = []

# Iterate over all records
for i in tqdm(range(len(test_set)), desc="Processing records"):
    record, _ = train_set[i]  # Get the record (ignore the ECG data for now)
    
    # Flatten the 'leads_info' list into separate columns for each lead
    for j, lead_info in enumerate(record['leads_info']):
        for key, value in lead_info.items():
            record[f'lead_{j}_{key}'] = value
    del record['leads_info']  # We don't need the 'leads_info' list anymore

    # Append the record to the list
    records.append(record)

# Convert the list of records into a DataFrame
df = pd.DataFrame(records)

# Save the DataFrame to a CSV file
df.to_csv('test_set_records.csv', index=False)

print(f"Processed {len(records)} records.")

Processing records: 100%|██████████| 22352/22352 [00:43<00:00, 510.54it/s]


Processed 22352 records.


Now that the header data has been extracted and saved to csv files, we can map the corresponding SNOWMED-CT code to the csv files too.

First, let's load the SNOWMED-CT mappings:

In [37]:
smowmed_mappings_path = r'C:\Users\navme\Desktop\ECG_Project\Data\SNOWMED-CT Codes\combined_mappings.csv'
smowmed_mappings_path = convert_to_forward_slashes(smowmed_mappings_path)

# Load the SNOMED-CT mappings
smowmed_mappings = pd.read_csv(smowmed_mappings_path)
smowmed_mappings.head(2)

Unnamed: 0,Dx,SNOMEDCTCode,Abbreviation,CPSC,CPSC_Extra,StPetersburg,PTB,PTB_XL,Georgia,Chapman_Shaoxing,Ningbo,Total,Notes
0,atrial fibrillation,164889003,AF,1221,153,2,15,1514,570,1780,0,5255,
1,atrial flutter,164890007,AFL,0,54,0,1,73,186,445,7615,8374,


In [38]:
# Select the 'Dx' and 'SNOMEDCTCode' columns
codes = smowmed_mappings[['Dx', 'SNOMEDCTCode']]

# Set 'SNOWMEDCTCode' as the index
codes.set_index('SNOMEDCTCode', inplace=True)

# Convert the DataFrame into a dictionary
codes_dict = codes['Dx'].to_dict()

Now, let's load the csv files and map the corresponding codes from ```codes_dict``` to the csv files:

In [46]:
train_set_path = convert_to_forward_slashes(r'C:\Users\navme\Desktop\ECG_Project\Data\PhysioNet\train_set_records.csv')
val_set_path = convert_to_forward_slashes(r'C:\Users\navme\Desktop\ECG_Project\Data\PhysioNet\val_set_records.csv')
test_set_path = convert_to_forward_slashes(r'C:\Users\navme\Desktop\ECG_Project\Data\PhysioNet\test_set_records.csv')

In [47]:
train_set_df = load_and_process(train_set_path)
val_set_df = load_and_process(val_set_path)
test_set_df = load_and_process(test_set_path)

Now, using the ```map_codes_to_dx()``` function, let's map the SNOWMED-CT codes for each ECG signal ```dx```. The new column containing the diagnosis name will be ```dx_modality``` 

In [50]:
def map_codes_to_dx(codes):
    return [codes_dict.get(int(code), code) for code in codes]

In [51]:
train_set_df['dx_modality'] = train_set_df['dx'].apply(map_codes_to_dx)

In [52]:
val_set_df['dx_modality'] = val_set_df['dx'].apply(map_codes_to_dx)

In [53]:
test_set_df['dx_modality'] = test_set_df['dx'].apply(map_codes_to_dx)

In [56]:
test_set_df['dx_modality'][0]

['atrial fibrillation', 'right bundle branch block', 't wave abnormal']

Now, let's save the updated csv files to new csv files.

In [57]:
train_set_df.to_csv('processed_train_set_records.csv', index=False)

In [58]:
val_set_df.to_csv('processed_val_set_records.csv', index=False)

In [59]:
test_set_df.to_csv('processed_test_set_records.csv', index=False)

Now, that data preprocessing is completed, we can proceed and build out DL pipeline.

# DL Pipeline

In [62]:
processed_train_set_path = convert_to_forward_slashes(r'C:\Users\navme\Desktop\ECG_Project\Data\PhysioNet\processed_train_set_records.csv')
processed_val_set_path = convert_to_forward_slashes(r'C:\Users\navme\Desktop\ECG_Project\Data\PhysioNet\processed_val_set_records.csv')
processed_test_set_path = convert_to_forward_slashes(r'C:\Users\navme\Desktop\ECG_Project\Data\PhysioNet\processed_test_set_records.csv')

In [63]:
processed_train_df = pd.read_csv(processed_train_set_path)
processed_val_df = pd.read_csv(processed_val_set_path)
processed_test_df = pd.read_csv(processed_test_set_path)

The first step in the model's pipeline is to create: 

## TextEncoder()

Create a class, ```TextEncoder()``` that is used to convert the description of the (dx_modality) diagnosis class into embeddings using the ClinicalBERT model.

- Input should be a concatenated using comma or blank space string of diagnoses/dx_modality per ECG signal.
- Use processed CSV files (dx_modality vs dx_modality, age, etc together)
- Frozen weights (since it's already pretrained)

In [60]:
class TextEncoder:
    def __init__(self):
        self.tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
        self.model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")

    def encode(self, text_list):
        # Check if text_list is a string representation of a list
        if isinstance(text_list, str):
            text_list = ast.literal_eval(text_list)
        # Convert list of strings to a single string
        text = ', '.join(text_list)
        # Tokenize text
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
        # Get embeddings from ClinicalBERT model
        with torch.no_grad():
            embeddings = self.model(**inputs).last_hidden_state
        # Average the embeddings to get single vector per each input
        embeddings = torch.mean(embeddings, dim=1)
        return embeddings

In [65]:
if isinstance(processed_train_df['dx_modality'][4], str):
    print('yes')
else:
    print('no')

yes


In [66]:
# Example of TextEncoder
encoder = TextEncoder()
embeddings = encoder.encode(processed_train_df['dx_modality'][0])

# Check size of the embeddings
print(embeddings.size())

torch.Size([1, 768])


The next steps in the pipeline are to create: 

## 1. 1D-CNN Model

This 1D-CNN will be used as the input for the ```ECGEncoder()```

## 2. ECGEncoder() 

- Input is ECG signal, output will be embeddings of ECG signal
- This is going to be model in model.py 
- Model weights are updated iteratively
- optimizer = torch.optim.Adam(clip_model.ECGEncoder.parameters())

### 1. 1D-CNN Model

In [69]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

In [71]:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

class OneDimCNN(nn.Module):
    def __init__(self, num_classes):
        super(OneDimCNN, self).__init__()

        # Layer 1
        self.conv1 = nn.Conv1d(in_channels=3, out_channels=32, kernel_size=3, stride=1, padding=1)
        self.bn1 = nn.BatchNorm1d(32)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.AvgPool1d(kernel_size=2, stride=2)

        # Layer 2
        self.conv2 = nn.Conv1d(in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=1)
        self.bn2 = nn.BatchNorm1d(64)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.AvgPool1d(kernel_size=2, stride=2)

        # Layer 3
        self.conv3 = nn.Conv1d(in_channels=64, out_channels=128, kernel_size=3, stride=1, padding=1)
        self.bn3 = nn.BatchNorm1d(128)
        self.relu3 = nn.ReLU()
        self.pool3 = nn.AvgPool1d(kernel_size=2, stride=2)

        # Layer 4
        self.conv4 = nn.Conv1d(in_channels=128, out_channels=256, kernel_size=3, stride=1, padding=1)
        self.bn4 = nn.BatchNorm1d(256)
        self.relu4 = nn.ReLU()
        self.pool4 = nn.AvgPool1d(kernel_size=2, stride=2)

        # Layer 5
        self.conv5 = nn.Conv1d(in_channels=256, out_channels=512, kernel_size=3, stride=1, padding=1)
        self.bn5 = nn.BatchNorm1d(512)
        self.relu5 = nn.ReLU()
        self.pool5 = nn.AvgPool1d(kernel_size=2, stride=2)

        # Layer 6
        self.conv6 = nn.Conv1d(in_channels=512, out_channels=1024, kernel_size=3, stride=1, padding=1)
        self.bn6 = nn.BatchNorm1d(1024)
        self.relu6 = nn.ReLU()
        self.pool6 = nn.AvgPool1d(kernel_size=2, stride=2)

        # Layer 7
        self.conv7 = nn.Conv1d(in_channels=1024, out_channels=2048, kernel_size=3, stride=1, padding=1)
        self.bn7 = nn.BatchNorm1d(2048)
        self.relu7 = nn.ReLU()
        self.pool7 = nn.AvgPool1d(kernel_size=2, stride=2)

        # Layer 8
        self.conv8 = nn.Conv1d(in_channels=2048, out_channels=4096, kernel_size=3, stride=1, padding=1)
        self.bn8 = nn.BatchNorm1d(4096)
        self.relu8 = nn.ReLU()
        self.pool8 = nn.AvgPool1d(kernel_size=2, stride=2)

        # Layer 9
        self.conv9 = nn.Conv1d(in_channels=4096, out_channels=8192, kernel_size=3, stride=1, padding=1)
        self.bn9 = nn.BatchNorm1d(8192)
        self.relu9 = nn.ReLU()
        self.pool9 = nn.AvgPool1d(kernel_size=2, stride=2)

        # Layer 10
        self.conv10 = nn.Conv1d(in_channels=8192, out_channels=16384, kernel_size=3, stride=1, padding=1)
        self.bn10 = nn.BatchNorm1d(16384)
        self.relu10 = nn.ReLU()
        self.pool10 = nn.AvgPool1d(kernel_size=2, stride=2)

        # Layer 11
        self.conv11 = nn.Conv1d(in_channels=16384, out_channels=32768, kernel_size=3, stride=1, padding=1)
        self.bn11 = nn.BatchNorm1d(32768)
        self.relu11 = nn.ReLU()
        self.pool11 = nn.AvgPool1d(kernel_size=2, stride=2)

        # Layer 12
        self.conv12 = nn.Conv1d(in_channels=32768, out_channels=65536, kernel_size=3, stride=1, padding=1)
        self.bn12 = nn.BatchNorm1d(65536)
        self.relu12 = nn.ReLU()
        self.pool12 = nn.AvgPool1d(kernel_size=2, stride=2)

        # Fully connected layer 1
        self.fc = nn.Linear(65536, num_classes)
        self.relu13 = nn.ReLU()
        self.dropout = nn.Dropout(p=0.5)

        # Fully connected layer 2
        self.fc2 = nn.Linear(num_classes, num_classes)

    def forward(self, x):
        # Layer 1
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu1(out)
        out = self.pool1(out)

        # Layer 2
        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu2(out)
        out = self.pool2(out)

        # Layer 3
        out = self.conv3(out)
        out = self.bn3(out)
        out = self.relu3(out)
        out = self.pool3(out)

        # Layer 4
        out = self.conv4(out)
        out = self.bn4(out)
        out = self.relu4(out)
        out = self.pool4(out)

        # Layer 5
        out = self.conv5(out)
        out = self.bn5(out)
        out = self.relu5(out)
        out = self.pool5(out)

        # Layer 6
        out = self.conv6(out)
        out = self.bn6(out)
        out = self.relu6(out)
        out = self.pool6(out)

        # Layer 7
        out = self.conv7(out)
        out = self.bn7(out)
        out = self.relu7(out)
        out = self.pool7(out)

        # Layer 8
        out = self.conv8(out)
        out = self.bn8(out)
        out = self.relu8(out)
        out = self.pool8(out)

        # Layer 9
        out = self.conv9(out)
        out = self.bn9(out)
        out = self.relu9(out)
        out = self.pool9(out)

        # Layer 10
        out = self.conv10(out)
        out = self.bn10(out)
        out = self.relu10(out)
        out = self.pool10(out)

        # Layer 11
        out = self.conv11(out)
        out = self.bn11(out)
        out = self.relu11(out)
        out = self.pool11(out)

        # Layer 12
        out = self.conv12(out)
        out = self.bn12(out)
        out = self.relu12(out)
        out = self.pool12(out)

        # Flatten
        out = out.view(out.size(0), -1)

        # Fully connected layer 1
        out = self.fc(out)
        out = self.relu13(out)
        out = self.dropout(out)

        # Fully connected layer 2
        out = self.fc2(out)

        return out

In [72]:
class ECGEncoder(OneDimCNN):
    def __init__(self, num_classes):
        super(ECGEncoder, self).__init__(num_classes)
        self.fc3 = nn.Linear(128, 768)  # Change the input dimension to match the output of OneDimCNN

    def forward(self, signal):  # Rename this method to `forward`
        embedding = super().forward(signal)  # Call the parent class's `forward` method
        embedding = self.fc3(embedding)  # Apply the new linear layer to the output of OneDimCNN
        return embedding

In [74]:
# Define the number of classes
num_classes = 126  # Replace with the actual number of classes

# Create an instance of the model
ecg_encoder = ECGEncoder(num_classes)

# Convert the numpy array to a PyTorch tensor
input_data = torch.from_numpy(train_set[0][1]['val']).float()

# Add an extra dimension for the batch size
input_data = input_data.unsqueeze(0)

# Convert the model's weights to Float
ecg_encoder = ecg_encoder.float()

# Pass the data through the model
output = ecg_encoder(input_data)

print(output.size())