
# #QuickGrab 

#### For cases where I open the file for the ONLY purpose of copying important code.

*This markdown cell and the following code cell were created after finishing this whole script


In [1]:
# Make autocomplete a little bit quicker
%config Completer.use_jedi = False
from pj_funcs import *

##########################################################################

# Import past few days' PHESS data
file = pd.read_csv('../data/raw/export.csv', encoding = 'Cp1252')

# Import NSSP_Priority Elements csv file to be used in my custom ADT-parsing function
reader = pd.read_csv('../data/processed/column_guide_with_key.csv')

##########################################################################

# impliment my custom ADT-parsing function
df = NSSP_Element_Grabber(file,Timed=True)

# Run the completeness function!
comp = completeness_facvisits(df, Timed=True)


84.8871603012085


  empty.loc[facility,:] = (countz/visit_count)*100


Time Elapsed:   2.612 seconds



# Group By Visit


### The File
export.csv - containts 1317 observations sampled from PHESS over a span of 2 days.  All columns from the original data table are included.

### Goal
<u>Create method to organize the data into distinct visits</u>

The background behind this is that one patient can be admitted (ADT-A01), have their record updated 30 times (ADT-A08), and be discharged (ADT-A03).  While they only visited the hospital once, they have 32 different records.  These need to be merged for a more comprehensive dataset.

### Strategies

I have a strong sense that the fastest way to go about this is going to be by using the `Pandas.groupby()` function.  Wonderful explanation for this including examples can be found [here](https://realpython.com/pandas-groupby/).
<ol>
    <li> Begin by getting a feel for the data.  Are there duplicates?  What fields are very complete vs very incomplete?  How fast can we loop through all the rows? </li>
    <li> Begin testing certain columns on uniqueness </li>
        <li> Once an identifying column(s) is found, group by it, look at each subgroup.
            <ul>
                <li> Cut each group by admit messages (A01) </li>
                <li> Cut each group by arbitrary length of stay </li>
            </ul> </li>


# After steps 1,2,3:

### Interesting Notes:

<ul>
    <li> <b>Duplicates: </b> While there are no exact duplicate rows within our PHESS sample
        
`len(file) - len(file.drop_duplicates())`$29944-29944=0$.

There are some exact duplicate HL7 messages...

`len(file) - len(file['MESSAGE'].drop_duplicates())`$29944-29734=210$. </li> </ul>


--------------------------

    
More insight on Message duplicates! If we let row2 be any row (NOT row1) in the dataset, for cases where:

`(row1['MESSAGE'] == row2['MESSAGE']) & (row1 != row2)`

, we see that only a handful of columns show disagreement. These columns can be in the following set:

['CHIEF_COMPLAINT_UID',
 'SYSTEM_UPDATE_COUNT',
 'SYSTEM_DATE_ADDED',
 'SYSTEM_DATE_UPDATED']
 
 
To put it in its simplest terms, if we see a repeated MESSAGE in our dataset, the data entry between the two similar rows is nearly identical, but there are only 1-4 columns (never 0) that could yield different values. See list above for those column names.


# After Research:

I found that:
<ol>
    <li> A patient within a facility will (almost entirely) always have a Patient MRN </li>
    <li> If a patient visits the same facility more than once, the MRN number will be constant BUT the visit number will be different </li>
</ol>

![Patient%20MRN%20and%20Visit%20Number.png](attachment:Patient%20MRN%20and%20Visit%20Number.png)



#  Methodology

Therefore if we sort by facility, then by Patient MRN, then by visit number, we will get a list of distinct visits accounting for the fact that a patient might visit on separate occasions.


# Implimentation!

In [1]:
# Make autocomplete a little bit quicker
%config Completer.use_jedi = False
    
# Import[ant] libraries
import hl7
import pandas as pd
import numpy as np
import re
import os
import math
import matplotlib.pyplot as plt
from pj_funcs import *
import time

In [6]:
# Import past few days' PHESS data
file = pd.read_csv('../data/raw/export.csv', encoding = 'Cp1252')

# Import NSSP_Priority Elements csv file to be used in my custom ADT-parsing function
reader = pd.read_csv('../data/processed/column_guide_with_key.csv')


In [7]:
# impliment my custom ADT-parsing function

f = NSSP_Element_Grabber(reader,file,Timed=True)

Time Elapsed:   104.402  seconds


In [6]:
# Run the function!!!
comp = completeness_facvisits(f, Timed=True)

  empty.loc[facility,:] = (countz/visit_count)*100


Time Elapsed:   3.283 seconds


In [7]:
comp.head()

Unnamed: 0_level_0,Processed Column,C_Unique_Patient_ID,Visit_ID,Facility_Type_Code,Sending_Facility_ID,Site_ID,C_Facility_ID,Treating_Facility_ID,Admit_Date_Time,Patient_Class_Code,Patient_Zip,...,C_Patient_Age_Data_Source,C_Patient_Age_Units_Data_Source,C_Patient_Age_Years_Data_Source,C_Patient_County_Data_Source,C_Chief_Complaint_Data_Source,C_Visit_Date_Data_Source,MESSAGE,FACILITY_NAME,PATIENT_VISIT_NUMBER,PATIENT_MRN
Facility,Num_Visits,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
Eskenazi,63,100,100,95.2381,100,0,0,100,100,0,92.0635,...,100,100,100,90.4762,100.0,100,100,100,100,100
Reid Health,22,100,100,100.0,100,0,0,100,100,100,100.0,...,100,100,100,100.0,95.4545,100,100,100,100,100
Howard Regional,14,100,100,100.0,100,0,0,100,100,100,100.0,...,100,100,100,78.5714,100.0,100,100,100,100,100
Community East,49,100,100,100.0,100,0,0,100,100,100,97.9592,...,100,100,100,83.6735,100.0,100,100,100,100,100
Goshen General,12,100,100,100.0,100,0,0,100,100,100,100.0,...,100,100,100,100.0,100.0,100,100,100,100,100


In [57]:
file.columns

Index(['CHIEF_COMPLAINT_UID', 'MSG_SENDING_FACILITY', 'PATIENT_CARE_LOCATION',
       'ADMIT_DATETIME', 'MSG_SENDING_APPLICATION', 'UNIVERSAL_ID',
       'UNIVERSAL_ID_TYPE', 'MSG_RECEIVING_FACILITY', 'MSG_DATETIME',
       'EXTRACT_MSG_TYPE', 'MSG_CONTROL_ID', 'VERSION_ID', 'PATIENT_MRN',
       'PATIENT_FIRSTNAME', 'PATIENT_LASTNAME', 'PATIENT_MIDDLENAME',
       'PATIENT_SUFFIX', 'PATIENT_PREFIX', 'PATIENT_BIRTH_DATETIME',
       'PATIENT_SEX', 'PATIENT_RACE', 'PATIENT_ADDRESS', 'PATIENT_CITY',
       'PATIENT_STATE', 'PATIENT_ZIP', 'PATIENT_COUNTRY', 'PATIENT_COUNTY',
       'PATIENT_COUNTY_CODE', 'PATIENT_PHONE_NUMBER', 'PATIENT_ETHNICITY',
       'PATIENT_CLASS', 'PATIENT_VISIT_NUMBER', 'DISCHARGE_DISPOSITION',
       'DISCHARGE_DATETIME', 'ADMIT_REASON', 'MODE_OF_TRANSPORTATION',
       'CHIEF_COMPLAINT_TEXT', 'INITIAL_TEMPERATURE', 'INITIAL_PULSE_OXIMETRY',
       'TRIAGE_NOTES', 'DATE_OF_ONSET', 'PRELIMINARY_DIAGNOSIS',
       'SYSTEM_DATE_ADDED', 'SYSTEM_DATE_UPDATED', 'SYSTE

### Functions I wrote :)

In [None]:
####################################################################################################################
# Create completeness_report function.  Works on dataframe output of my custom ADT-parsing function
####################################################################################################################



def completeness_facvisits(df, Timed = False):
    
    '''
    1. Read in Pandas Dataframe outputted from NSSP_Element_Grabber() function.
    2. Group events by Facility->Patient MRN->Patient Visit Num
        to find unique visits
    3. Return Dataframe.
        dataframe.index -> Facility Name, Number of Visits
        dataframe.frame -> Percents of visits within hospital with
            non-null values in specified column
    
    Parameters
    ----------
    df : pandas.DataFrame, required
        should have format outputted from NSSP_Element_Grabber() function
    *Timed : bool, optional
        If True, gives completion time in seconds
    
    Returns
    -------
    DataFrame
        A pandas dataframe object is returned as a two dimensional data
        structure with labeled axes.
        
    Requirements
    ------------
    *Libraries*
    -from pj_funcs import *
 
    '''

    start_time = time.time()
    
    # Make a visit indicator that combines facility|mrn|visit_num
    df['VISIT_INDICATOR'] = df[['FACILITY_NAME', 'PATIENT_MRN', 'PATIENT_VISIT_NUMBER']].astype(str).agg('|'.join, axis=1)

    # Create array of Falses.  Useful down the road 
    false_array = np.array([False] * len(df.columns))

    # Create empty dataframe we will eventually insert into
    empty = pd.DataFrame(columns=df.columns)

    # Create empty lists for facility_names (facs) and number of patients in a facility (num_patients)
    # These lists will serve as our output's descriptive indexes
    num_visits = []
    facs = []

    # First sort our data by Facility Name.  Sort=False speeds up runtime
    fac_sort = df.groupby('FACILITY_NAME',sort=False)

    # Iterate through the groupby object
    for facility, df1 in fac_sort:

        # Append facility name to empty list
        facs.append(facility)

        # Initiate visit count
        visit_count = 0

        # Sort by Patient MRN
        MRN_sort = df1.groupby(['VISIT_INDICATOR'],sort=False)

        # Initiate list of 0s.  Each column gets +1 for each visit with a non-null column value.
        countz = false_array.copy().astype(int)

        for visit, df3 in MRN_sort:


            # Initiate array of falses
            init = false_array.copy()

            # Looping through the visits ADT data rows, look for non_null values.  True if non-null. 
            #       Use OR-logic to replace 0s in init with 1s and keep 1s as 1s for each iterated row.
            for i in np.arange(0,len(df3)):
                init = init | (df3.iloc[i].notnull())

            # Add information on null (0) vs. non-null (1) columns to countz which is initially all 0 but updates for each patient.
            countz += init.astype(int)

            # Show that the number of visits has increased
            visit_count += 1


        # Append visit number to empty list
        num_visits.append(visit_count)

        # Update empty dataframe with information on completeness (out of 100%) we had for each column
        # * note countz is a 1D array that counts how many visits have non-null values in each column.
        empty.loc[facility,:] = (countz/visit_count)*100


    # Clarify and Create index information for output Dataframe
    empty['Num_Visits'] = num_visits
    empty['Facility'] = facs
    empty = empty.set_index(['Facility','Num_Visits'])
    # Keep track of end time
    end_time = time.time()
    
    # If user requests to see elapsed time, show them it in seconds
    if Timed == True:
        print('Time Elapsed:   '+str(round((end_time-start_time),3))+' seconds')
    
    # Return filled dataframe.
    return empty

In [None]:
#######################################
# This one isn't as much recycled as something I just came up with
#     to make my life easier

#########################################

def LIKE(array,word):
    '''
    Finds all parts of list that have a word in them
    
    Parameters
    ----------
    array : list/array type, required
    word : str, required
    
    Returns
    -------
    np.array
        An array which is a subset of the original containing the word
        
    Requirements
    ------------
    -import numpy as np
    
    '''
    # Convert to numpy array.  Everything's easier with numpy
    array = np.array(array)
    
    # Create in-condition.  List of True/False for each element
    cond = np.array([str(word) in array[i] for i in np.arange(0,len(array))])
    
    # Enact that condition 
    subset = array[cond]
    
    # Return the subset
    return subset