# Original Database formatting
Use this Notebook for initial formatting of the database.

Checklist:
- Remove patients without ECG signal from data description file (done)
- Upload (and format) all ECG signals (done)
- Modularise code (done)
- Write checks in the functions - should be able to check for errors in metadata and flag this
- Ensure all of the metadata information is in a standard form

For more information on the pyECG module: 
https://www.researchgate.net/publication/331012096_PyECG_A_software_tool_for_the_analysis_of_the_QT_interval_in_the_electrocardiogram

https://pypi.org/project/pyECG/

In [2]:
# Importing packages
import os
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from pyecg import ECGRecord
import json
import pyarrow.feather as feather

## WFDB file formats
### Header files
The record line
E.g. S0250ECG 8 1000 1256376 15:26:59 2006
- Record name: S0250ECG
- Number of signals: 8
- Sampling frequency (in samples per second per signal): 1000
- Number of samples per signal: 1256376
- Base time (time of day corresponding with the beginning of the record): 15:26:59
- Base date: 2006 (this is in the wrong format in the database, a quick fix is just removing it from the header but this may be annoying)

Signal specification lines
Each of the non-empty, non-comment lines following the record line represent 1 signal.
E.g. S0250ECG.dat 16 1(0)/uV 16 0 -88 25184 0 ecg_0
- File name (of where the signal is stored): S0250ECG.dat
- Format: 16-bit-amplitudes (see documentation on signal files)
- ADC gain: 1 (uV)
- Baseline: 0 (uV)
- ADC resolution (bits): 16
- ADC zero: 0
- initial value (of signal): -88
- checksum (used to verify that the file hasn't been corrupted): 25184
- block size: 0
- Description: 'ecg_0'

For more information on header files: https://archive.physionet.org/physiotools/wag/header-5.htm

### Signal files...
Info can be found here: https://archive.physionet.org/physiotools/wag/signal-5.htm

## Creating Metadata Files


In [2]:
# Setting root directory of database to source data from
root = 'D:\Molecool\Databases\Database1'
droot = 'D:\Molecool\og_database'

In [3]:
# The variables of interest in this database
general = ['Group', 'Diabetes Duration','age','BMI','Hb A1C%','CRP (mg/L)','Neuropathy AUTONOMIC SYMPTOMS','WBC K/uL','RBC m/uL','Hgb g/dL','GLUCOSE mg/dL','URINE CREAT mg/dL','URINE ALBUMIN mg/dL', 'CHOLESTmg/dL','LDL CALCmg/dL','Retinopathy Grading']
small = ['Group', 'Diabetes Duration','age','BMI','Hb A1C%','CRP (mg/L)','Neuropathy AUTONOMIC SYMPTOMS']

In [29]:
# Creating the large metadata file
# List of patients with ECG readings 
df_csv = pd.read_csv((droot + '/data_description/GE-75_files_per_subject.csv'), encoding = 'latin-1')
df_csv.set_index('Subject ID', inplace=True)
df_csv.index = df_csv.index.str.upper() # Patient IDs not uniformly entered
df_csv = df_csv.iloc[: , 1:-3] #Drop last 3 columns as well as the group column
df_csv = df_csv.loc[(df_csv!=0).any(1)] # Keep patients with data associated

# # Create folders for all of the patients with data
for sub in df_csv.index:
    path = os.path.join(root, sub)
    os.mkdir(path)

# Importing all other patient variables into a dataframe
df_meta = pd.read_csv((droot + '/data_description/GE-75_data_summary_table.csv'), encoding = 'latin-1')
df_meta.set_index('patient ID', inplace=True)
df_meta.index = df_meta.index.str.upper()
df_meta = df_meta[general] #Only taking variables of interest
df_meta = pd.concat([df_meta, df_csv], axis=1).reindex(df_csv.index) #Now only use patients with data associated, and...
#...combine the two dataframes

# # 8 Controls and 46 Diabetics from the 88 initial total
# # Now saving this as a new file for future use
df_meta.to_json((root + '/LMeta.json'), orient='index')

In [None]:
#Checking that data has been saved correctly
# Opening JSON file
f = open(root + '/formatted_data/LMeta.json')
 
# returns JSON object as
# a dictionary
data = json.load(f)
 
# Iterating through the json
# list
print(json.dumps(data, indent=4))
 
# Closing file
f.close()

In [30]:
# # Creating and saving small metadata file
df_meta = df_meta[small]
df_meta.to_json((root + '/SMeta.json'), orient='index')

## Loading the data into feather files

In [61]:
def create_metadata(header, path):
    """Create metadata file from the header. Saves total number of samples in the signal and time of the day the signal was started in json file for the patient"""
 
    f_line = header.readline().split()
    if len(f_line) < 5:
        d = {'Length of reading':f_line[3], 'Sampling rate': f_line[2], 'Error Flag': False, 'Error Type': 'No error'}
    else:
        d = {'Length of reading':f_line[3], 'Start time': f_line[4], 'Sampling rate': f_line[2], 'Error Flag': False, 'Error Type': 'No error'}
    #os.mkdir(path)
    with open((path + '\\Meta.json'), "w") as outfile:
        json.dump(d, outfile)

In [39]:
def create_feather(header_path, lead_names, path):
    """Create Feather file using header file to extract signal from .DAT file"""
    df = pd.DataFrame()
    record = ECGRecord.from_wfdb(header_path)
    
    #Using pyECG Library
    for lead in lead_names:
        signal = record.get_lead(lead)
        df[lead] = pd.Series(signal)
            
    feather.write_feather(df, (path + '/ECG.ftr'))

In [62]:
def read_samples(dpath, ecg_lead_names, folder_name, just_metadata):
    """Cycle through the files in the folder specified (dpath), create json metadata file...
    ...and feather signal file for each. Save these in the folder associated with the patient. """
    files = sorted(os.listdir(dpath))

    for file in files: #Cycle through files in the database
        if file.endswith('.hea'):
            #Reading and storing the data into structure 
            hea_path = dpath + file #Change the location of the file or folder.
            f = open(hea_path, "r")
            pat_name = file[:5].upper()
            print('Reading data for subject ' + pat_name)
            header = open(hea_path, "r")
            path = root + '\\' + pat_name + '\\' + folder_name #Patient folder
            #os.mkdir(path)
            create_metadata(header, path)
            if not just_metadata:
                create_feather(hea_path, ecg_lead_names, path)

            f.close()

In [63]:
### Uploading all three ECG sample types ###
# Looking at the overnight/12min walking data
dpath = droot + "\\ecgdata\\" #Change the location of the file or folder.
read_samples(dpath, ['ecg_0','ecg_1'], 'holter', True)

# Looking at the head-up-tilt data
dpath = droot + "\\labview\\converted\\head-up-tilt\\" 
read_samples(dpath, ['ecg'], 'hut', True)

# Looking at the head-up-tilt data
dpath = droot + "\\labview\\converted\\sit-to-stand\\" 
read_samples(dpath, ['ecg'], 'sts', True)


Reading data for subject S0250
Reading data for subject S0256
Reading data for subject S0273
Reading data for subject S0282
Reading data for subject S0283
Reading data for subject S0287
Reading data for subject S0288
Reading data for subject S0292
Reading data for subject S0296
Reading data for subject S0300
Reading data for subject S0301
Reading data for subject S0304
Reading data for subject S0308
Reading data for subject S0310
Reading data for subject S0312
Reading data for subject S0314
Reading data for subject S0315
Reading data for subject S0316
Reading data for subject S0317
Reading data for subject S0318
Reading data for subject S0326
Reading data for subject S0327
Reading data for subject S0339
Reading data for subject S0342
Reading data for subject S0349
Reading data for subject S0365
Reading data for subject S0366
Reading data for subject S0368
Reading data for subject S0372
Reading data for subject S0381
Reading data for subject S0382
Reading data for subject S0390
Reading 

In [64]:
#Checking that you can open the JSON file correctly
with open('D:\Molecool\Databases\Database1\S0300\sts\Meta.json') as json_file:
    data = json.load(json_file)
    print(data)

{'Length of reading': '1094400', 'Sampling rate': '1000', 'Error Flag': False, 'Error Type': 'No error'}


In [3]:
#Checking that you can open the feather file correctly
df = pd.read_feather('D:\Molecool\Databases\Database1\S0300\sts\ECG.ftr')
df.head()

Unnamed: 0,ecg
0,-0.068345
1,-0.066821
2,-0.066528
3,-0.064652
4,-0.062542
