## PSV to DF

This notebook loads each PSV file from the Physionet database and concatenates all the records into a single dataframe object. Note that a new patient ID row is created, which is based on the filename. This enables downstream analysis to distinguish between subjects.

The expected file format is a folder called "training_setA" and "training_setB" which each contain 20000 files. Running this script is optional as the final dataframe has been uploaded to google drive and is automatically downloaded in the feature_engineering notebook

Import libraries

In [1]:
import pandas as pd
import pdb
import numpy as np
import glob

In [3]:
#get a list of all the files
files1 = glob.glob('training_setA/*.psv')
files2 = glob.glob('training_setB/*.psv')
files = np.concatenate((files1, files2))

df_list = []
for ind, f in enumerate(files):
    patient_id = f.split('/')[1].split('.')[0]
    df = pd.read_csv(f, sep='|')
    df = df.assign(patient=patient_id)

    #redefine the labels to be 1 for t >= t_sepsis
    #in other words, a label of 1 now means that sepsis has occurred in this window
    #in practice, what this means is set the first six 1 labels to 0
    df.loc[df[df['SepsisLabel'] == 1].head(6).index.values, 'SepsisLabel'] = 1
    
    #print a status update
    if ind % 1000 == 0:
        print(ind)
    
    #append the current parsed file to the list 
    df_list.append(df)


#save all the loaded files into a pickle file
df = pd.concat(df_list)
df = df.reset_index(drop=True)
df.to_pickle('combined.pkl')


0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
