## Dataset List: 

1. PhysioNet 2021 Challenge

```
    Data Sources: 
    
    CPSC Database and CPSC-Extra Database
    INCART Database
    PTB and PTB-XL Database
    The Georgia 12-lead ECG Challenge (G12EC) Database
    Augmented Undisclosed Database
    Chapman-Shaoxing and Ningbo Database
    The University of Michigan (UMich) Database
```

## PhysioNet 2021 Challenge

The training data contains twelve-lead ECGs. The validation and test data contains twelve-lead, six-lead, four-lead, three-lead, and two-lead ECGs:

1. Twelve leads: I, II, III, aVR, aVL, aVF, V1, V2, V3, V4, V5, V6
2. Six leads: I, II, III, aVR, aVL, aVF
3. Four leads: I, II, III, V2
4. Three leads: I, II, V2
5. Two leads: I, II

Each ECG recording has one or more labels that describe cardiac abnormalities (and/or a normal sinus rhythm).

The Challenge data include annotated twelve-lead ECG recordings from six sources in four countries across three continents. These databases include over 100,000 twelve-lead ECG recordings with over 88,000 ECGs shared publicly as training data.

For example, a header file A0001.hea may have the following contents:

```
    A0001 12 500 7500
    A0001.mat 16+24 1000/mV 16 0 28 -1716 0 I
    A0001.mat 16+24 1000/mV 16 0 7 2029 0 II
    A0001.mat 16+24 1000/mV 16 0 -21 3745 0 III
    A0001.mat 16+24 1000/mV 16 0 -17 3680 0 aVR
    A0001.mat 16+24 1000/mV 16 0 24 -2664 0 aVL
    A0001.mat 16+24 1000/mV 16 0 -7 -1499 0 aVF
    A0001.mat 16+24 1000/mV 16 0 -290 390 0 V1
    A0001.mat 16+24 1000/mV 16 0 -204 157 0 V2
    A0001.mat 16+24 1000/mV 16 0 -96 -2555 0 V3
    A0001.mat 16+24 1000/mV 16 0 -112 49 0 V4
    A0001.mat 16+24 1000/mV 16 0 -596 -321 0 V5
    A0001.mat 16+24 1000/mV 16 0 -16 -3112 0 V6
    #Age: 74
    #Sex: Male
    #Dx: 426783006
    #Rx: Unknown
    #Hx: Unknown
    #Sx: Unknown
```

From the first line of the file:
- We see that the recording number is A0001, and the recording file is A0001.mat. 
- The recording has 12 leads, each recorded at a 500 Hz sampling frequency, and contains 7500 samples. 
- From the next 12 lines of the file (one for each lead), we see that each signal:
    - Was written at 16 bits with an offset of 24 bits
    - The floating point number (analog-to-digital converter (ADC) units per physical unit) is 1000/mV 
    - The resolution of the analog-to-digital converter (ADC) used to digitize the signal is 16 bits, and the baseline value corresponding to 0 physical units is 0. 
    - The first value of the signal (-1716, etc.), the checksum (0, etc.), and the lead name (I, etc.) are the last three entries of each of these lines. 
- From the final 6 lines, we see that the patient is:
    - A 74-year-old male 
    - With a diagnosis (Dx) of 426783006, which is the **SNOMED-CT code** for sinus rhythm. 
    - The medical prescription (Rx), history (Hx), and symptom or surgery (Sx) are unknown. 

- Please visit WFDB header format for more information on the header file and variables.

## Data Source Information

1. CPSC Database and CPSC-Extra Database

- Together, these databases contain 13,256 ECGs (10,330 ECGs shared as training data, 1,463 retained as validation data, and 1,463 retained as test data).
- Each recording is between 6 and 144 seconds long with a sampling frequency of 500 Hz.
- Per HIPAA guidelines ages over 89 are not provided for these datasets herein.
- cpsc_2018, 6,877 recordings
- cpsc_2018_extra, 3,453 recordings

2. INCART (st_petersburg_incart) Database

- This source contains 74 annotated ECGs (all shared as training data) extracted from 32 Holter monitor recordings. 
- Each recording is 30 minutes long with a sampling frequency of 257 Hz.

3. PTB and PTB-XL Database

- The source contains 22,353 ECGs (all shared as training data). 
- Each recording is between 10 and 120 seconds long with a sampling frequency of either 500 or 1,000 Hz.
- PTB, 516 recordings
- PTB-XL, 21,837 recordings

4. Georgia 12-lead ECG Challenge (G12EC) Database

- This source contains 20,672 ECGs (10,344 ECGs shared as training data, 5,167 retained as validation data, and 5,161 retained as test data). 
- Each recording is between 5 and 10 seconds long with a sampling frequency of 500 Hz.

5. Shaoxing People’s Hospital (Chapman-Shaoxing) and Ningbo First Hospital (Ningbo) Database  

- This source contains 45,152 ECGS (all shared as training data). 
- Each recording is 10 seconds long with a sampling frequency of 500 Hz
- chapman-shaoxing, 10,247 recordings
- ningbo, 34,905 recordings

NOTE: Under each dataset folder the files are grouped into subfolders with up to 1000 records per subfolder. These subfolders are named as g# where the # starts at 1. Once 1000 records are allocated to a folder a new folder is started with the # incremented by one.  

## Data Exploration & Examples

In [None]:
pip install wfdb

In [6]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import torch

# Matlab/WFDB files
import scipy.io as sio
import wfdb

# helper_functions.py
import helper_functions as hf

In [2]:
PhysioNet_PATH = f'C:/Users/navme/Desktop/ECG_Thesis_Local/PhysioNet-2021-Challenge/physionet.org/files/challenge-2021/1.0.3/training'
PhysioNet_PATH

'C:/Users/navme/Desktop/ECG_Thesis_Local/PhysioNet-2021-Challenge/physionet.org/files/challenge-2021/1.0.3/training'

In [19]:
# Get all the file paths in the PhysioNet directory
physionet_file_PATHS = []

In [20]:
for file in os.listdir(PhysioNet_PATH):
    file_path = os.path.join(PhysioNet_PATH, file)
    file_path = file_path.replace('\\', '/')
    physionet_file_PATHS.append(file_path)

In [21]:
physionet_file_PATHS

['C:/Users/navme/Desktop/ECG_Thesis_Local/PhysioNet-2021-Challenge/physionet.org/files/challenge-2021/1.0.3/training/chapman_shaoxing',
 'C:/Users/navme/Desktop/ECG_Thesis_Local/PhysioNet-2021-Challenge/physionet.org/files/challenge-2021/1.0.3/training/cpsc_2018',
 'C:/Users/navme/Desktop/ECG_Thesis_Local/PhysioNet-2021-Challenge/physionet.org/files/challenge-2021/1.0.3/training/cpsc_2018_extra',
 'C:/Users/navme/Desktop/ECG_Thesis_Local/PhysioNet-2021-Challenge/physionet.org/files/challenge-2021/1.0.3/training/georgia',
 'C:/Users/navme/Desktop/ECG_Thesis_Local/PhysioNet-2021-Challenge/physionet.org/files/challenge-2021/1.0.3/training/index.html',
 'C:/Users/navme/Desktop/ECG_Thesis_Local/PhysioNet-2021-Challenge/physionet.org/files/challenge-2021/1.0.3/training/ningbo',
 'C:/Users/navme/Desktop/ECG_Thesis_Local/PhysioNet-2021-Challenge/physionet.org/files/challenge-2021/1.0.3/training/ptb',
 'C:/Users/navme/Desktop/ECG_Thesis_Local/PhysioNet-2021-Challenge/physionet.org/files/challen

In [24]:
os.listdir(physionet_file_PATHS[5])

['g1',
 'g10',
 'g11',
 'g12',
 'g13',
 'g14',
 'g15',
 'g16',
 'g17',
 'g18',
 'g19',
 'g2',
 'g20',
 'g21',
 'g22',
 'g23',
 'g24',
 'g25',
 'g26',
 'g27',
 'g28',
 'g29',
 'g3',
 'g30',
 'g31',
 'g32',
 'g33',
 'g34',
 'g35',
 'g4',
 'g5',
 'g6',
 'g7',
 'g8',
 'g9',
 'index.html']

In [None]:
class PhysioNetDataset(torch.utils.data.Dataset):
    def __init__(self, dataset_path, train=False):
        self.dataset_path = dataset_path
        self.train = train
        self.file_list = os.listdir(dataset_path)
        self.file_PATHS = []
        for file in os.listdir(dataset_path):
            file_path = os.path.join(dataset_path, file)
            file_path = file_path.replace('\\', '/')
            self.physionet_file_PATHS.append(file_path)        