# Main Thesis Topic: “Zero-shot classification of ECG signals using CLIP-like model”. 

**For example: Train on PBT-XL:** 

- Text Encoder: ClinicalBERT (trained on diagnoses of ECG signal to obtain corresponding embeddings)
- Image Encoder: 1D-CNN (used to encode ECG signal to obtain signal embeddings)

- Experiment A): Baseline: We can take only the name of the class. For example, take “Myocardial Infarction” as a text. We should exclude some classes from training and after training is completed, the CLIP-like model can be tested on these excluded classes. 
    - Next, we get embeddings of text from ClinicalBERT and train the ECG encoder with contrastive loss.

- Experiment B): Same as Experiment A but instead of testing on the same dataset/classes, we would test on other datasets containing different classes.

**Evaluation metrics:** 
- Main: AUC-ROC, average_precison_score, 
- Optional: Specificity, Sensitivity, F1-score 

**Outcome:** 
- It’s possible to train CLIP-like models with freezed (or unchanged/not fine tuned for downstream tasks) text encoder
- Training ECG encoders that are viable for representing different domains (within ECG modality) and previously unseen classes. 
- Training a CLIP-like model on ECGs has little novelty.

# Multi-Class ECG Classifier

First, we preprocess the ECG data from the PhysioNet 2021 challenge dataset. This data will be loaded using the ```PhysioNetDataset``` class.

In [23]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sys
from tqdm import tqdm
from scipy.signal import resample
import torch
from transformers import AutoTokenizer, AutoModel
import ast
import scipy.io as sio
from torch.utils.data import random_split

In [2]:
sys.path.append('C:/Users/navme/Desktop/ECG_Project/PyFiles')

In [3]:
from helper_functions import *
from dataset import *

In [4]:
# Path to dir/PhysioNet-2021-Challenge/physionet.org/files/challenge-2021/1.0.3/training
PhysioNet_PATH = f'C:/Users/navme/Desktop/ECG_Thesis_Local/PhysioNet-2021-Challenge/physionet.org/files/challenge-2021/1.0.3/training'
PhysioNet_PATH

'C:/Users/navme/Desktop/ECG_Thesis_Local/PhysioNet-2021-Challenge/physionet.org/files/challenge-2021/1.0.3/training'

Using the ```PhysioNet_PATH```, we can create separate datasets for training, testing & validation.

# Data Preprocessing 

- train_set (train & validation data)
- test_set (test data)

In [8]:
train_set = PhysioNetDataset(PhysioNet_PATH, train=True)
test_set = PhysioNetDataset(PhysioNet_PATH, train=False)

len(train_set), len(test_set)

(65900, 22352)

The ```train_set``` can be split into ```current_train``` and ```current_val```. 

In [31]:
# Set the seed for the random number generator
torch.manual_seed(0)

# Get the length of the train_set
length = len(train_set)

# Calculate the lengths of the splits
train_length = int(0.85 * length)
val_length = length - train_length

# Split the dataset
current_train, current_val = random_split(train_set, [train_length, val_length])

The next step is to extract the header data for ```current_train```, ```current_val```, and ```test_set``` and save the data to a csv file. 

## current_train

In [32]:
# Initialize an empty list to store the records
records = []

# Iterate over all records
for i in tqdm(range(len(current_train)), desc="Processing records"):
    record, _ = train_set[i]  # Get the record (ignore the ECG data for now)
    
    # Flatten the 'leads_info' list into separate columns for each lead
    for j, lead_info in enumerate(record['leads_info']):
        for key, value in lead_info.items():
            record[f'lead_{j}_{key}'] = value
    del record['leads_info']  # We don't need the 'leads_info' list anymore

    # Append the record to the list
    records.append(record)

# Convert the list of records into a DataFrame
df = pd.DataFrame(records)

# Save the DataFrame to a CSV file
df.to_csv('train_set_records.csv', index=False)

print(f"Processed {len(records)} records.")

Processing records: 100%|██████████| 56015/56015 [14:35<00:00, 63.96it/s] 


Processed 56015 records.


## current_val

In [None]:
# Initialize an empty list to store the records
records = []

# Iterate over all records
for i in tqdm(range(len(current_val)), desc="Processing records"):
    record, _ = train_set[i]  # Get the record (ignore the ECG data for now)
    
    # Flatten the 'leads_info' list into separate columns for each lead
    for j, lead_info in enumerate(record['leads_info']):
        for key, value in lead_info.items():
            record[f'lead_{j}_{key}'] = value
    del record['leads_info']  # We don't need the 'leads_info' list anymore

    # Append the record to the list
    records.append(record)

# Convert the list of records into a DataFrame
df = pd.DataFrame(records)

# Save the DataFrame to a CSV file
df.to_csv('train_set_records.csv', index=False)

print(f"Processed {len(records)} records.")

## test_set