# COVID-19 Testing Episodes
Written by: Branson Chen, Hannah Chung, Kinwah Fung <br>
Last modified: 20200506

## Table of Contents

<a href='#Overview'>Overview</a><br>
<a href='#Input-variables'>Input variables</a><br>
<a href='#Importing-data'>Importing data</a><br>
<a href='#Roll-up-to-testing-episode'>Rollup to testing episode</a><br>
<a href='#Final-output'>Final output</a><br>
- <a href='#Data-definitions'>Data definitions</a><br>
- <a href='#Additional-information'>Additional information</a><br>

## Overview

- This script takes as input the file created by COVID19_processing.ipynb
    - Please specify output_flag = 1 or 2 in the previous program so the dataset contains the required variables
- Next, exclusions are applied to remove observations with resultstatus = N/X/W
- Some variables are created or assigned (e.g., interpretation_flag, covidtest)
- The roll-up from TEST RESULTS to TESTING EPISODES occurs with multiple steps:
    - TEST RESULTS are rolled-up to TEST REQUESTS: for each TEST REQUEST (ordersid+fillerordernumberid), select the result prioritizing latest release time and then clear covid results (covidtest) and then interpretations (interpretation_flag) and then covid result hierarchy (covidcode; P>I>N>D>C>R>blank) 
    - TEST REQUESTS are rolled-up to LAB ORDERS: for each LAB ORDER (ordersid), select the result prioritizing clear covid results (covidtest) and then latest release time and then covid result hierarchy (covidcode; P>I>N>D>C>R>blank) 
    - LAB ORDERS (with a patientid) are rolled-up to TESTING EPISODES: for each TESTING EPISODE (patientid+observationdate), select the result prioritizing the covid result hierarchy (covidcode; P>I>N>D>C>R>blank) and then latest release time and then smallest ordersid
- Multiple lab orders for each testing episode will be transposed and added to the final dataset (ordersid1-n)
- Multiple lab orders that contain the final covid result for each testing episode will be concatenated and added to the final dataset (final_result_ordersids)
- Records without patientids will then be appended to the final dataset

## Input variables

In [None]:
#input path and filename
input_path = ''
input_filename = 'output.csv'

#output filename (should be .csv file)
output_filename = 'episode.csv'

## Importing data

In [None]:
import pandas as pd
import numpy as np

In [None]:
df_raw = pd.read_csv(input_path+input_filename, dtype={'patientid':np.object})

In [None]:
#keep necessary columns
df = df_raw.copy(deep = True)
df = df[['patientid', 'ordersid', 'fillerordernumberid', 'observationdatetime', 
         'observationcode', 'observationreleasets', 'observationresultstatus', 'exclude_flag', 'covid']]
df = df.fillna('')

df['observationdatetime'] = pd.to_datetime(df['observationdatetime'])
df['observationreleasets'] = pd.to_datetime(df['observationreleasets'])
print('Number of records: ' + str(len(df)))

In [None]:
#remove observations based on observationresultstatus
df_clean = df[~df['observationresultstatus'].isin(('N','X'))]
print('Number of records removed with result status N/X: ' + str(sum(df['observationresultstatus'].isin(('N','X')))))
df_clean = df_clean[df_clean['exclude_flag'] == 'N']
print('Number of records removed with result status W (exclude_flag): ' + str(sum(df['exclude_flag'] == 'Y')))
print('Number of records remaining: ' + str(len(df_clean)))
print('Breakdown of covid variable:')
print(df_clean['covid'].value_counts())

In [None]:
#hierarchy: P > I > N > D > C > R > blank
#P = Positive, I = Indeterminate, N = Negative, D = penDing, C = Cancelled, R = Rejected/invalid, '' = blank
result_mappings = {'P':1,'I':2,'N':3,'D':4,'C':5,'R':6,'':7}

#assign S (presumptive-positive) as P (positive)
df_clean.loc[df_clean['covid'] == 'S', 'covid'] = 'P'

#convert covid result variable (from previous script) to number for hierarchy
df_clean['covidcode'] = df_clean['covid'].map(result_mappings)

###create new variables
#interpretation_flag: observations that have a covid interpretation code
df_clean['interpretation_flag'] = df_clean['observationcode'].apply(
    lambda x: 'T' if x in ('XON10842-3','XON12338-0','XON13527-7') else 'F')
#covidtest flag: observations that have a clear covid result (P, I, N, D)
df_clean['covidtest'] = df_clean['covidcode'].apply(lambda x: 'T' if x in (1,2,3,4) else 'F')

## Roll-up to testing episode

In [None]:
###ROLL UP from TEST RESULTS to TEST REQUESTS
#hierarchy: latest observationreleasets > covidtest = 'T' > interpretation_flag = 'T' > covidcode (P>I>N>D>C>R>blank)
df_clean.sort_values(['ordersid','fillerordernumberid','observationreleasets','covidtest','interpretation_flag','covidcode'],
                     ascending=[True, True, False, False, False, True], inplace=True)

df_testrequests = df_clean.groupby(['ordersid','fillerordernumberid']).first().reset_index()
print('Number of TEST REQUESTS: ' + str(len(df_testrequests)))

In [None]:
###ROLL UP from TEST REQUESTS to LAB ORDERS
#hierarchy: covidtest = 'T' > latest observationreleasets > covidcode (P>I>N>D>C>R>blank)
df_testrequests.sort_values(['ordersid','covidtest','observationreleasets','covidcode'],
                     ascending=[True, False, False, True], inplace=True)

df_laborders = df_testrequests.groupby(['ordersid']).first().reset_index()
print('Number of LAB ORDERS: ' + str(len(df_laborders)))

In [None]:
#add date versions of datetime
df_laborders['observationdate'] = df_laborders['observationdatetime'].apply(lambda x: np.datetime64(x, 'D'))
df_laborders['observationreleasedate'] = df_laborders['observationreleasets'].apply(lambda x: np.datetime64(x, 'D'))

#split up records without a patientid
keep_cols = ['patientid','ordersid','covid','covidcode','covidtest',
             'observationdate','observationreleasedate', 'observationreleasets']
df_laborders_nopat = df_laborders.loc[df_laborders['patientid'] == '', keep_cols]
df_laborders_pat = df_laborders.loc[df_laborders['patientid'] != '', keep_cols]
print('Number of records without a patientid: ' + str(len(df_laborders_nopat)))

In [None]:
###ROLL UP from lab orders to testing episodes (distinct by patientid-observationdate)
#hierarchy: covidcode (P>I>N>D>C>R>blank) > latest observationreleasets > ordersid
df_laborders_pat.sort_values(['patientid','observationdate','covidcode','observationreleasets','ordersid'],
                             ascending=[True, True, True, False, True], inplace=True)
df_episodes = df_laborders_pat.groupby(['patientid','observationdate']).first().reset_index()
print('Number of TESTING EPISODES (with patientids): ' + str(len(df_episodes)))

In [None]:
#transpose all orders for each episode
df_pivot = df_laborders_pat[['patientid','observationdate','ordersid']].copy()
df_pivot.sort_values(['patientid','observationdate','ordersid'], inplace=True)

df_pivot['order_num'] = df_pivot.groupby(['patientid','observationdate'])['ordersid'].rank(method='first')
df_pivot['order_num'] = df_pivot['order_num'].apply(lambda x: 'ordersid' + str(int(x)))
df_pivot = pd.pivot_table(df_pivot, values='ordersid', index=['patientid','observationdate'], columns='order_num').reset_index()

#new variable that counts number of pivoted ordersids
df_pivot['numordersid'] = df_pivot.iloc[:,2:].count(axis=1)

In [None]:
#concatenate ordersids with the same FINAL covidcode for each testing episode into final_result_ordersids
df_result_ordersids = pd.merge(df_episodes[['patientid','observationdate','covidcode']],
                     df_laborders_pat[['patientid','observationdate','covidcode','observationreleasets','ordersid']],
                     on=['patientid','observationdate','covidcode'],
                     how='inner')
df_result_ordersids.sort_values(['patientid','observationdate','covidcode','observationreleasets','ordersid'],
                                ascending=[True,True,True,False,True],
                                inplace=True)
df_result_ordersids = df_result_ordersids.groupby(['patientid','observationdate','covidcode'])\
    ['ordersid'].apply(lambda x: ','.join(map(str, map(int, x)))).reset_index()
df_result_ordersids = df_result_ordersids.rename(columns={'ordersid':'final_result_ordersids'}).drop('covidcode',1)

In [None]:
#merging testing episodes with pivoted orders and final_result_ordersids
df_final = pd.merge(df_pivot, df_episodes.drop(['ordersid','covidcode','observationreleasets'],1),
                    on=['patientid','observationdate'], how='inner')
df_final = pd.merge(df_final, df_result_ordersids, on=['patientid','observationdate'], how='inner')

#adding the records with no patient id
df_laborders_nopat['ordersid1'] = df_laborders_nopat['ordersid']
df_laborders_nopat['final_result_ordersids'] = df_laborders_nopat['ordersid'].apply(int).apply(str)
df_laborders_nopat['numordersid'] = 1
df_final = pd.concat([df_final,
                     df_laborders_nopat.drop(['ordersid','observationreleasets'],1)],
                     sort=False).rename(columns={'covid':'covidresult'})
df_final = df_final.drop('covidcode',1).sort_values(['patientid','observationdate'])

print('Number of TESTING EPISODES (with records without patientids): ' + str(len(df_final)))
print('Breakdown of covid variable:')
print(df_final['covidresult'].value_counts())

## Final output

In [None]:
#FINAL RESULT TO OUTPUT
df_final.to_csv(output_filename, index=False)

### Data definitions

- patientid: PATIENTID variable from input file
- observationdate: Specimen collection date
- ordersid1-n: Transposed ORDERSIDs for this OBSERVATIONDATE
- numordersid: Total # ORDERSIDs for this OBSERVATIONDATE [specimen collection date]
- covidresult: COVID19 test result using hierarchy (Positive > Indeterminate > Negative > penDing > Cancelled > Rejected)
- covidtest: Is it a COVID test (T/F)? T if COVIDRESULT = Positive, Indeterminate, Negative, or penDing
- observationreleasedate: Observation release date
- final_result_ordersids: Comma-delimited ORDERSIDs that have the same final COVIDRESULT after rolling-up

### Additional information

- If there are multiple ordersids that have the same final covidresult in the episode, they are listed in final_result_orderids, ordered by the latest observationrelesets then by ordersid
- The observationreleasedate variable comes from the first ordersid in final_result_orderids (the first listed has the latest observationreleasets
- We suggest using the first ordersid in final_result_orderids to retrieve any additional order-level information from the original data source