The purpose of this jupyter notebook is to walk through how to convert data from how it outputs from the berneulli system (30 minute xml files per bed with 2 second parent elements) into flat files by patient, bed, date. These three attributes were chosen to prevent collision, because this program is designed to be run in parallel I didn't want two workers to be editing the same file at the same time. 

This is the first program that is run to process the data. At this point the assumption is that the data in uncompressed XML files that are in folders by device that are in folders by day, that are in a parent folder.

That structure looks something like

Hospital 2019

Date

Device

XML

The input of this function are those XML files, the output are flatfiles.

In [None]:
import time
import os
import shutil
import subprocess
import datetime
import multiprocessing
import xml.etree as etree
import xml.etree.ElementTree as ET
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
spark.conf.set("spark.sql.execution.arrow.enabled","true")
import pandas as pd
import os
import sys
sys.path.insert(1, './.local/lib/python3.9/site-packages')
sys.path.insert(1, '/users/PAS2164/lawr47/.local/bin')

parseFile expects to be passed the path to an individual file. This function is designed to be run in parallel and as such it has a few essentricities such as only having one argument and outLoc being hard coded.

The input of parseFile is a path, the output are 3 flat files per patient in the bed on a given day. One file for alarms, one for waveforms, and one for measurements.

The comments within the code explain what is going on.

In [3]:
def parseFile(filePath):
    #Define the place where files are going to be saved.
    outLoc="/fs/ess/scratch/PAS2164/James2019Out/"
    #From the path , determine the date and device name to name files later.
    pollloc=filePath.split("/")[-2]
    polldate=filePath.split("/")[-3]
    #For some sample file convert the xml into a tree.
    xmlTree=ET.parse(filePath)
    root=xmlTree.getroot()
    #For each element of the tree...
    num=0
    for child in root:
        #For each child of that element
        num=num+1
        num2=0
        for child2 in child:
            #of each child of that child element, set variable defaults and if elements are populated define those elements. 
            num2=num2+1
            num3=0
            mrn='none'
            mrnt='none'
            mrna='none'
            fname='none'
            lname='none'
            lca='none'
            lr='none'
            lb='none'
            for settings in child2.iter('settings'):
                for s in settings:
                    try:sets=s.get('name')
                    except:sets='none'
                    if sets=="patientIdPrimary-id":
                        try:mrn=s.text
                        except:mrn='none'
                    if sets=="patientIdPrimary-type":
                        try:mrnt=s.text
                        except:mrnt='none'
                    if sets=="patientIdPrimary-authority":
                        try:mrna=s.text
                        except:mrna='none'
                    if sets=="patientNameGiven":
                        try:fname=s.text
                        except:fname='none'
                    if sets=="patientNameFamily":
                        try:lname=s.text
                        except:lname='none'
                    if sets=="assignedLocationCareArea":
                        try:lca=s.text
                        except:lca='none'
                    if sets=="assignedLocationRoom":
                        try:lr=s.text
                        except:lr='none'
                    if sets=="assignedLocationBed":
                        try:lb=s.text
                        except:lb='none'
            #At this point we know the patient that is in the bed and we can try to write a file (Because files at this stage are Patient,Device,date so there is no intersection.)                
            try:writeAlarm=open(outLoc+"Alarm^"+str(mrn)+'^'+str(pollloc)+'^'+str(polldate)+'.tsv','a')
            except:writeAlarm=open(outLoc+"Alarm^"+str(mrn)+'^'+str(pollloc)+'^'+str(polldate)+'.tsv','W')
            try:writeMeasure=open(outLoc+"Measurement^"+str(mrn)+'^'+str(pollloc)+'^'+str(polldate)+'.tsv','a')
            except:writeMeasure=open(outLoc+"Measurement^"+str(mrn)+'^'+str(pollloc)+'^'+str(polldate)+'.tsv','W')
            try:writeWave=open(outLoc+"Wave^"+str(mrn)+'^'+str(pollloc)+'^'+str(polldate)+'.tsv','a')
            except:writeWave=open(outLoc+"Wave^"+str(mrn)+'^'+str(pollloc)+'^'+str(polldate)+'.tsv','W')
            polltime='none'
            #For each measruement in the measurements child element built a string and write it to file. 
            for measurements in child2.iter('measurements'):
                for m in measurements:
                    try:thyme=m.get('name')
                    except:thyme='none'
                    if thyme =="POLLTIME":
                        polltime=(m.text)
                        break
            Pkaystring=str(mrn)+'\t'+str(mrnt)+'\t'+str(mrna)+'\t'+str(fname)+'\t'+str(lname)+'\t'+str(lca)+'\t'+str(lr)+'\t'+str(lb)+'\t'+str(polltime)
            for mes in child2.iter('measurements'):
                for m in mes:
                    try:mtag=m.tag
                    except:mtag='none'
                    if mtag=='m':
                        try:mesname=m.get('name')
                        except:mesname='none'
                        if mesname not in mlist:
                            try:msite=m.get('site')
                            except:msite='none'
                            try:muom=m.get('uom')
                            except:muom='none'
                            try:mtext=m.text
                            except:mtext='none'
                            messtring=Pkaystring+'\t'+str(mesname)+'\t'+str(msite)+'\t'+str(muom)+'\t'+str(mtext)+'\n'
                            writeMeasure.write(messtring)
            #For each alarm in the alarm child element built a string and write it to file. 
            aname='none'
            a31='none'
            a32='none'
            a33='none'
            a34='none'
            a35='none'
            a36='none'
            for alarms in child2.iter('alarms'):
                for a in alarms:
                    try:aname=a.get('name')
                    except:aname='none'
                    try:a31=a.get('abnormalflags')
                    except:a31='none'
                    try:a32=a.get('inactivation-state')
                    except:a32='none'
                    try:a33=a.get('sil')
                    except:a33='none'
                    try:a34=a.get('setlow')
                    except:a34='none'
                    try:a35=a.get('sethi')
                    except:a35='none'
                    try:a36=a.get('chan-value')
                    except:a36='none'
                    alarmstring=Pkaystring+'\t'+str(aname)+'\t'+str(a31)+'\t'+str(a32)+'\t'+str(a33)+'\t'+str(a34)+'\t'+str(a35)+'\t'+str(a36)+'\n'
                    writeAlarm.write(alarmstring)
            for WFL in child2.iter('measurements'):
                for mg in measurements:
                    #For each waveform in the waveform child element built a string and write it to file. 
                    if mg.tag=='mg':
                        mgname=mg.get('name')
                        mgGain='none'
                        mgHZ='none'
                        mgwave='none'
                        mguom='none'
                        mgsite='none'
                        mgscale='none'
                        mginvalid='none'
                        mgmissing='none'
                        mgPoints='none'
                        mgPointsBytes='none'
                        mgMin='none'
                        mgMax='none'
                        mgOffset='none'
                        for mgw in mg:
                            try:mgwn=(mgw.get('name'))
                            except:mgwn='none'
                            if mgwn=='Gain':
                                try:mgGain=mgw.text
                                except:mgGain='none'
                            if mgwn=='HZ':
                                try:mgHZ=mgw.text
                                except:mgHZ='none'
                            if mgwn=='Wave':
                                try:mgwave=mgw.text
                                except:mgwave='none'
                                try:mguom=mgw.get('uom')
                                except:mguom='none'
                                try:mgsite=mgw.get('site')
                                except:mgsite='none'
                                try:mgscale=mgw.get('scale')
                                except:mgscale='none'
                                try:mginvalid=mgw.get('invalid')
                                except:mginvalid='none'
                                try:mgmissing=mgw.get('missing')
                                except:mgmissing='none'
                            if mgwn=='Points':
                                try:mgPoints=mgw.text
                                except:mgPoints='none'
                            if mgwn=='PointsBytes':
                                try:mgPointsBytes==mgw.text
                                except:mgPointsBytes='none'
                            if mgwn=='Min':
                                try:mgMin==mgw.text
                                except:mgMin='none'
                            if mgwn=='Max':
                                try:mgMax==mgw.text
                                except:mgMax='none'
                            if mgwn=='Offset':
                                try:mgOffset=mgw.text
                                except:mgOffset='none'
                        wavestring=Pkaystring+'\t'+str(mgname)+'\t'+str(mgGain)+'\t'+str(mgHZ)+'\t'+str(mgwave)+'\t'+str(mguom)+'\t'+str(mgsite)+'\t'+str(mgscale)+'\t'+str(mginvalid)+'\t'+str(mgmissing)+'\t'+str(mgPoints)+'\t'+str(mgPointsBytes)+'\t'+str(mgMin)+'\t'+str(mgMax)+'\t'+str(mgOffset)+'\n'
                        writeWave.write(wavestring)
    #Close all of the files to avoid issues with their modification later.
    writeAlarm.close()
    writeMeasure.close()
    writeWave.close()

The parseDevice file is repsonsible for identifying what work has alreay been done, determining what work to be done next, and reserving that work so that other programs don't also do that work. 

The input path is a device folder. 

The output is a pool of processess that convert the xml files in that folder to flat files. 

This is not explicitly efficient because the pools open and close for each file; however, because there are multiple processess that all are doing the same type of work at the same time this was necessary to avoid intersection while also minimizing the amount of time required for the process to determine what work it needs to do. 

In [None]:
def parseDevice(folderPath):
    global outLoc
    #define paths, masterFile.txt is a list of device paths that have been started, masterFinish is a list of device paths that have been completed.
    masterFile="/fs/ess/scratch/PAS2164/James2019Out/masterFile.txt"
    masterFinish="/fs/ess/scratch/PAS2164/James2019Out/masterFinish.txt"
    outLoc="/fs/ess/scratch/PAS2164/James2019Out/"
    pathList=[]
    #make a list of each process that has been started.
    writeMaster=open(masterFile)
    for l in writeMaster:
        pathList.append(l)
    writeMaster.close()
    #if a process has already been started, skip to the next process.
    newFolder=1
    if (folderPath+'\n') in pathList:
        newFolder=0
    if newFolder==1:
        #If the process is a new process, and should be done (Devices ending in CSG_GE are ignored because they are an aggregate dataset)
        if len(folderPath.split("CSG_GE"))==1:
            #write in masterFile that the selected process is started.
            writeMaster=open(masterFile,'a')
            writeMaster.write((folderPath+'\n'))
            writeMaster.close()
            #Open a pool of workers
            pool = multiprocessing.Pool()
            #for each file for in the device folder
            for file in os.listdir(folderPath):
                #Assigne a worker to process that file. 
                filePath=(folderPath+file)
                try:
                    #parseFile(filePath)
                    pool.apply_async(parseFile, args=(filePath,))
                except:
                    print(filePath,outLoc,"parseFile Fail")
            #once all files are processed, close the pool, and write that the job is finished in the masterFinish.txt file.
            pool.close()
            pool.join()
            masterFin=open(masterFinish,'a')
            masterFin.write((folderPath+'\n'))
            masterFin.close()

mlist is an artifact of when I first started performing processing and was exploring to see which types of xml elements existed.

In [4]:
mlist=[]
mlist.append('blockSQL')
mlist.append('deviceTime')
mlist.append('MODE')
mlist.append('POLLTIME')
mlist.append('sessionID')
mlist.append('tag')
mlist.append('TZ')
mlist.append('TZ_offset')
mlist.append('startDateTime')
mlist.append('DeviceTime')
mlist.append('none')
mlist.append('blockSQN')

outLoc="/fs/ess/scratch/PAS2164/Test/"

Finally this cell contains the parent function which loops through each date and each device and passes them to parseDevice.

In [None]:
targetFolder="/fs/ess/scratch/PAS2164/James2019/"
for date in os.listdir(targetFolder):
    datePath=targetFolder+date+"/"
    print(datePath)
    for device in os.listdir(datePath):
        folderPath=datePath+device+"/"
        try:parseDevice(folderPath)
        except:print(folderPath+" Failed")

/fs/ess/scratch/PAS2164/James2019/2019-09-13/
/fs/ess/scratch/PAS2164/James2019/2019-05-24/
/fs/ess/scratch/PAS2164/James2019/2019-11-05/
/fs/ess/scratch/PAS2164/James2019/2019-05-30/
/fs/ess/scratch/PAS2164/James2019/2019-07-14/
/fs/ess/scratch/PAS2164/James2019/2019-05-07/
/fs/ess/scratch/PAS2164/James2019/2019-07-06/
/fs/ess/scratch/PAS2164/James2019/2019-12-14/
/fs/ess/scratch/PAS2164/James2019/2019-10-02/
/fs/ess/scratch/PAS2164/James2019/2019-07-26/
/fs/ess/scratch/PAS2164/James2019/2019-10-21/
/fs/ess/scratch/PAS2164/James2019/2019-06-11/
/fs/ess/scratch/PAS2164/James2019/2019-09-08/
/fs/ess/scratch/PAS2164/James2019/2019-08-01/
/fs/ess/scratch/PAS2164/James2019/2019-05-23/
/fs/ess/scratch/PAS2164/James2019/2019-10-28/
/fs/ess/scratch/PAS2164/James2019/2019-08-13/
/fs/ess/scratch/PAS2164/James2019/2019-05-04/
/fs/ess/scratch/PAS2164/James2019/2019-06-12/
/fs/ess/scratch/PAS2164/James2019/2019-12-24/
/fs/ess/scratch/PAS2164/James2019/2019-12-07/
/fs/ess/scratch/PAS2164/James2019/

At the end of this step, repairReset is run to set masterFile equal to masterFinish so this program can be used to repair incompleted devices.