# COVID-19 Variant of Concern (VOC) Lab Results
Written by: Branson Chen, Sina Brar <br> 
Last modified: 20210317

## Table of Contents

<a href='#Overview'>Overview</a><br>
<a href='#Input-variables'>Input variables</a><br>
<a href='#Importing-data'>Importing data</a><br>
<a href='#Text-analysis'>Text analysis</a><br>

- <a href='#Algorithm-description'>Algorithm description</a><br>
- <a href='#Initial-processing'>Initial processing</a><br>
- <a href='#Assign-results'>Assign results</a><br>

<a href='#Final-output'>Final output</a><br>
<a href='#Roll-Up'>Roll Up</a><br>
<a href='#Manual-review'>Manual review</a><br>
<a href='#Testing-and-validation'>Testing and validation</a><br>

## Overview

- This script first imports a SAS dataset based on the input variables provided, and then fields are decoded/renamed.
- <strong> The SAS dataset is created by taking any records under two TR Codes: TR12952-8 (VOC screening), TR12953-6 (VOC sequencing) and any records that contain the strings: '(VOC', ' VOC', or both 'VARIANT' and 'SARS' </strong>
- Next, the text is cleaned (clean function) and then tokenized (tokenize function).
- Relevant labels are then assigned to the tokens (assign_labels function).
- The labelled tokens are then interpreted using an in-house algorithm (interpret function).
- All of the information from the previous step is then collapsed to give one result per virus per test (process_result function), and unidentified virus/test types are filled in based on observation codes and testrequest codes.
- The results are converted to a single character per test type (char_output function) and then output in a csv.


- The second part of the script collapses the final output from the previous section into TESTING EPISODES (unique patientid+observationdate). 
- First, exclusions are applied to remove observations with resultstatus = N/X/W and observations with missing patientid.
- Records with no results are removed and they are identified by the temporary scr_flag and seq_flag columns.
- Flags (scr_test, seq_test) are created for each test type to identify records where there is at least one clear result (P, N, I, D).
- Observations (with a patientid) are rolled-up to TESTING EPISODES: for each TESTING EPISODE and result column, select the result by prioritizing clear results (P>N>I>D) with the latest release timestamp and then any result with the latest release timestamp.
- For each test type, an observation release date (observationreleasedate_scr, observationreleasedate_seq) is specified by taking the latest release date with a clear result.

## Input variables

In [None]:
#input path and filename (should be .sas7bdat file)
input_path = '//'

#Change the date of input file if needed
input_filename = '.sas7bdat'

#name of patientid variable in input dataset, will be renamed as 'patientid'
input_patientid_var = 'ikn'

#output additional columns
#1 = with key columns, 2 = with ALL columns
output_flag = 1

#output filename
output_filename = 'output_voc'

## Importing data

In [None]:
import pandas as pd
import numpy as np

In [None]:
%%time
#import sas file (COMPRESS=BINARY MAY NOT WORK WITH READ_SAS; COMPRESS=YES|CHAR DOES NOT WORK WITH READ_SAS)
df_raw=pd.read_sas(input_path+input_filename)

#decode strings (np objects)
df_raw.loc[:, df_raw.dtypes == np.object] = df_raw.loc[:, df_raw.dtypes == np.object].apply(lambda x: x.str.decode('UTF-8'))
df_raw.fillna('', inplace=True)
print('# of records:',len(df_raw))

In [None]:
df = df_raw.copy(deep = True)

#rename variables
df = df.rename(columns={input_patientid_var:'patientid','fillerordernumber':'fillerordernumberid',
                       'observationvalue':'value','observationsubid':'subid'})
#keep key cols
key_cols = ['patientid', 'ordersid', 'fillerordernumberid', 
            'reportinglaborgname', 'performinglaborgname', 'observationdatetime','specimenreceiveddatetime',
            'testrequestcode', 'observationcode', 'observationreleasets', 
            'observationresultstatus', 'subid', 'value']
df = df[key_cols]

#set exclude_flag based on observationresultstatus = W
df_W = df.loc[df['observationresultstatus'] == 'W', ['ordersid', 'observationcode', 'value']]
df_excl = df[['ordersid', 'observationcode', 'value']].reset_index().merge(df_W, how='inner').set_index('index')
df['exclude_flag'] = 'N'
df.loc[df.index.isin(df_excl.index),['exclude_flag']] = 'Y'
print(df['exclude_flag'].value_counts())

#set exclude_flag based on DO NOT TRANSMIT code
# DNT_text = '<p1:MicroOrganism xmlns:p1="http://www.ssha.ca"><p1:Code>99999999999</p1:Code><p1:Text>Do Not Transmit</p1:Text><p1:CodingSystem>HL79905</p1:CodingSystem></p1:MicroOrganism>'
# df_DNT = df.loc[df['value'] == DNT_text, ['ordersid', 'observationcode','observationreleasets']]
# df_excl2 = df[['ordersid', 'observationcode', 'observationreleasets']].reset_index().merge(df_DNT, how='inner').set_index('index')
# df.loc[df.index.isin(df_excl2.index),['exclude_flag']] = 'Y'
# print(df['exclude_flag'].value_counts())

In [None]:
%%time
#determine which observations need to be concatenated
group_cols = ['ordersid', 'fillerordernumberid', 'reportinglaborgname', 
              'testrequestcode', 'observationcode', 'observationreleasets', 'observationresultstatus']
df_gp_subid = df.reset_index().groupby(group_cols).agg({'index':tuple, 'subid':tuple}).reset_index()
df_gp_subid = df_gp_subid.rename(columns={'index':'original_indexes'})

#only concatenate ones where there are more than two subids, all the subids are numbers and contains 1
df_to_concat = df_gp_subid[df_gp_subid['subid'].apply(lambda x: all([subid.isdigit() for subid in x]) and len(x) > 2 and '1' in x)]
concat_indexes = [i for tup in df_to_concat['original_indexes'] for i in tup]

#concatenate based on subid
df_gp_concat = df[df.index.isin(concat_indexes)].reset_index()
df_gp_concat['subid'] = df_gp_concat['subid'].apply(int)
df_gp_concat = df_gp_concat.sort_values(by = group_cols+['subid']).groupby(group_cols)
df_gp_concat = df_gp_concat.agg({'index': tuple,
                   'value': lambda x: ' '.join(map(str, x))}).reset_index()

#add on records that were not concatenated
df_gp = df.loc[~df.index.isin(concat_indexes), group_cols+['value']].reset_index()
df_gp['index'] = df_gp['index'].apply(lambda x: (x,))
df_gp = pd.concat([df_gp_concat, df_gp], sort=False).rename(columns={'index':'original_indexes'})

#narrow down columns of df
df_cols = ['patientid','ordersid','fillerordernumberid','observationdatetime','specimenreceiveddatetime','testrequestcode',
           'observationcode','observationreleasets', 'observationresultstatus','exclude_flag']
df = df[df_cols]

print('# of TEST RESULTS:', len(df_gp))

#cleanup
del df_W
del df_excl
del df_gp_subid
del df_to_concat
del concat_indexes
del df_gp_concat

## Text analysis

In [None]:
import nltk
import re

In [None]:
#clean punctuation, xml field, numbers, other text
puncs = [';', ':', ',', '.', '-', '_', '/', '(', ')', '[', ']', '{', '}', '<', '>', '*', '#', '?', '.', '+', 
        'br\\', '\\br', '\\e\\', '\\f\\', '\\t\\', '\\r\\', '\\', "'", '"', '=']
terms_to_space = ['detected', 'by', 'positive', 'parainfluenza', 'accession']
nums_following = ['date', 'telephone', 'tel', 'phone', 'received', 'collected',  
                 'result', 'on', 'at', '@', 'approved', 'final', 'time', 'number']
strings_to_replace = {'non detected':'not detected', 'npot detected':'not detected', 
                      'nor detected':'not detected', 'mot detected':'not detected', 
                      'n0t detected':'not detected', 'nit detected':'not detected',
                      'covid 19 virus not interpretation detected':'covid 19 virus interpretation not detected',
                      'presumptive interpretation':'interpretation presumptive',
                      'preliminary interpretation':'interpretation preliminary',
                      'covid 19 not detected and covid 19 detected':'covid 19 detected and covid 19 not detected',
                      'virusnot':'virus not', 'prevuous':'previous',
                      'mutaion': 'mutation', 'muation':'mutation','dectection':'detection',
                      'sars coc': 'sars cov', 'cov @': 'cov 2', 'cov2': 'cov 2', '2voc': '2 voc',
                      'covid 19': 'sars cov 2',
                      'u k':'uk', '2uk': '2 uk', 'e4874k': 'e484k',
                      'b 1 1 7':'b117', 'b 1 351':'b1351',
                      'p 1':'p1','p 2':'p2',
                      'n501 y':'n501y', 'n5oy 1':'n501y', 'n5oy1':'n501y', '5o':'50', 'n51y':'n501y', 'n5g1y':'n501y',
                      '501y v2':'501yv2', '20i 501y v1':'20i501yv1','20b 501y v3':'20b501yv3',
                      'voc202012 01':'voc20201201', 'voc202012 02':'voc20201202', 'voc202101 02':'voc20210102',
                      'voc 202012 01':'voc20201201', 'voc 202012 02':'voc20201202', 'voc 202101 02':'voc20210102',
                      'varieant':'variant', 'vriant':'variant',
                      'wtih':'with', 'wih':'with', 'dtecteion':'detection', 'dtected':'detected',
                      'deteected':'detected', 'detetcetd':'detected'
            
                     }

date_id_patterns = [r'\d{2,4} \d{2} \d{2,4} ', r'\d{4} \d{2} ', r'\d{4}h ', 
                   r' \d{0,2}[a-z]{0,2}\d{5,}[a-z]{0,1}', r' [a-z]{0,2}\d{1,3}[a-z]{1,3}\d{4,}[a-z]{0,1}',
                   r' \d{2}[a-z]{1}\d{3}[a-z]{2}\d{4}', r' [a-z]{4,}\d{7,}']

def clean(value):
    cleaned = value.lower()

    #clean xml field, only keep text field surrounded with 'p1 text'
    pattern = r'(<p1:microorganism xmlns)(.+)(<p1:text>.+</p1:text>)(.+)(</p1:microorganism>)'
    while re.search(pattern, cleaned):
        cleaned = re.sub(pattern, r'\g<3>', cleaned)
    
    #surround terms with spaces (some terms found stuck together)
    for t in terms_to_space:
        cleaned = cleaned.replace(t, ' ' + t + ' ')
    
    #replace punctuation with space
    for punc in puncs:
        cleaned = cleaned.replace(punc, ' ')

    #remove consecutive spaces
    while '  ' in cleaned:
        cleaned = cleaned.replace('  ', ' ')
    
    cleaned = cleaned.strip()     
    
    #remove numbers after certain terms
    for term in nums_following:
        pattern = term + r' \d{1,4}'
        
        while re.search(pattern, cleaned):
            cleaned = re.sub(pattern, term, cleaned)

    #fix certain strings
    for k, v in strings_to_replace.items():
        cleaned = cleaned.replace(k, v)       
        
    #remove more dates and ids
    for pattern in date_id_patterns:
        while re.search(pattern, cleaned):
            cleaned = re.sub(pattern, '', cleaned)
    
    #remove numbers at the end, but exclude variant terms
    while len(cleaned) > 0 and (cleaned[-1].isdigit() or cleaned[-1] == ' ') and \
        not (cleaned[-4:] == 'b117' or cleaned[-5:] == 'b1351' or cleaned[-6:] in ('501yv1','501yv2','501yv3')
                or cleaned[-4:] in ('voc20201201','voc20201202','voc20210102') ) :
        cleaned = cleaned[:-1]
    
    #remove "no" at the end
    while cleaned.endswith(' no') or cleaned == 'no':
        cleaned = cleaned[:-3]
    
    return cleaned

In [None]:
#tokenize values using nltk
def tokenize(value):
    tokenized = nltk.word_tokenize(value)
   
    return tokenized

In [None]:
#assign labels for useful tokens based on some dictionaries and exclusions
voc_dict = {'v_sgene_n501y': ['n501y'],
            'v_sgene_e484k': ['e484k'],
            'v_sgene_k417n': ['k417n'],
            'v_sgene_k417t': ['k417t'],
            'v_voc_b117': ['uk', 'b117', 'voc20201201', '20i501yv1'],
            'v_voc_b1351': ['south', 'africa', 'african','b1351','501yv2','voc20201202'],
            'v_voc_p1': ['p1','brazil', 'brazilian','voc20210102','20b501yv3'],
            'v_voc_p2': [],
             }

indirect_matches_dict = {'r_pos': ['posi','pos1','covpos'], 
                         'r_neg': ['neg', 'naeg', 'neag'],  
                         'r_ind': ['indeter', 'eterminate', 'inconclu', 'inderter', 'unable',
                                   'equivocal', 'unresolved'],
                         'r_can': ['cancel', 'incorrect', 'duplicate', 'mislabel', 'recollect','mistaken','redirected'],
                         'r_rej': ['reject', 'inval', 'leak', 'insuffic', 
                                   'spill', 'inapprop', 'nsq', 'poor', 'uninterpret'],
                         'presumptive': ['presump', 'prelim', 'possi']}
direct_matches_dict = {'r_pos': ['detected', 'pos', 'deteced', 'postive', 'organism','isolated', 'evidence'],
                       'r_neg': ['no', 'not'],
                       'r_ind': ['ind'],
                       'r_pen': ['pending', 'progress', 'follow', 'ordered', 'reordered'], #'sent', 'send', 'forward' 
                       'presumptive': ['possible', 'probable'],
#                        'xml': ['p1'], 
                       'stop': ['specific', 'required', 'error', 'copy', 'see', 'laboratory',
                                'note', 'notes', 'stability', 'changed', 'recollect', 'moh', 'if'],
                       'final': ['interpretation', 'interpetation', 'interp', 'pretation', 'interpretive',
                                 'final', 'overall', 'corrected', 'proved', 'correct','current'],
                       'connecting': ['presence', 'as',
                                      'is', 'of', 'in', '1', '2', '3', '4', 'a', 'b', 'c',
                                      '229e', 'nl63', 'hku1', 'oc43', '2019', 'low',
                                      'biosafety', 'hazard', 'has', 'been', 'for', 'changed', 'identified', 
                                      'result', 'other', 'using', 'to', 'from', 'tested',
                                      'phl', 'phol', 'phlo', 'new', 'request', 'lab', 'will',
                                      'panel', 'seasonal', 'human', 'report', 'said', 'updated',
                                      'associated', 'with', 'associate', 'vocs',
                                      'voc','variant','concern', 'mutation','detection','characterization',
                                      'complete', 'rt','time','pcr'] #'19','sars','cov', 'lineage?','emerged','analysis'
                                      } 
test_type_dict = {
                  't_scr':['snp', 'screening','screen','screened','screens'], #rt,pcr, 'seegene', 'allplex',
                  't_seq':['genomic', 'analysis', 'sequencing','sequenced','sequence']} #genome

def assign_labels(tokenized):
    tokenized_length = len(tokenized)
    useful = [None]*tokenized_length #store same list length of tokens and update each accordingly
    
    for counter, token in enumerate(tokenized):
        #skip if already assigned
        if useful[counter]:
            continue
        
        ###easy viruses dictionary (non-exact matching)
        ## mutation/voc dict
        for virus, patterns in voc_dict.items():
            if any([pattern in token for pattern in patterns]):
                useful[counter] = virus
                break
        
        # VOC/variant of concern at beginning or preceded by interpretation, treat as virus term
        if token == 'variant' and (tokenized_length > counter+3)\
            and tokenized[counter+1:counter+3] == ['of','concern']\
            and (counter == 0 or tokenized[counter-1]=='interpretation'):
            useful[counter:counter+3] = ['v_voc_general', 'connecting', 'connecting']  
                  
        # elif useful[counter].startwith('v_sgene')  
        elif token in ('n501y','e484k','k417n','k417t') and (tokenized_length > counter+3)\
        and tokenized[counter+1:counter+3] in (['s','gene'],
                                               ['spike','gene'],
                                               ['spike','s']):
            useful[counter+1:counter+3] = ['connecting']*2  
        elif token in ('n501y','e484k','k417n','k417t') and (tokenized_length > counter+3)\
        and tokenized[counter+1:counter+3] in (['uk','variant'],
                                               ['b117','variant'],
                                               ['brazil', 'variant'],
                                               ['brazilian', 'variant'],
                                               ['b1351', 'variant'],
                                               ['south', 'africa'],
                                               ['south', 'african']
                                               ):
            useful[counter+1:counter+3] = ['connecting']*2
        
        # sgene
        elif token in ('s', 'spike') and (tokenized_length > counter+1)\
        and tokenized[counter+1] == 'gene':
            useful[counter:counter+2] = ['v_sgene_mutation', 'connecting']
        elif token in ('sars') and (tokenized_length > counter+5)\
        and tokenized[counter+1:counter+5] in (['cov', '2', 'gene', 'mutation'],):
            useful[counter:counter+5] = ['v_sgene_mutation','connecting', 'connecting','connecting','connecting']
        
        #SARS COV 2
        elif tokenized[counter:counter+7] == ['sars','cov','2','n501y','single','nucleotide','polymorphism']:
            useful[counter:counter+7] = ['connecting']*7
        elif tokenized[counter:counter+4] == ['sars','cov','2','virus']:
            useful[counter:counter+4] = ['v_covid']*4
        elif tokenized[counter:counter+3] == ['sars','cov','2']:
            useful[counter:counter+3] = ['v_covid']*3
            
        # SARS COV 2 VARIANT(S)/VOC/VARIANT(S) OF CONCERN #tokenized[counter+1:counter+3] != ['s','gene']
        elif token in ('variant','voc','variants','vocs') and tokenized[counter-3:counter] == ['sars','cov','2']\
        and tokenized_length > 3 and tokenized[counter+1] not in ('s','testing') \
        and tokenized[counter-5:counter-3] != ['associated','with']:
            useful[counter-3:counter]=['v_voc_general','connecting','connecting']
        
        # p1 voc emerged in the uk 
        elif token in ('p1') and (tokenized_length > counter+6)\
        and tokenized[counter+1:counter+6] in (['voc','emerged', 'in', 'the', 'uk'],):
            useful[counter:counter+5] = ['v_voc_p1','connecting', 'connecting','connecting','connecting','connecting']

    # loop over the record again
    for counter, token in enumerate(tokenized):
        #skip if already assigned
        if useful[counter]:
            continue
        
        #condition for mention of pos/neg
        elif token in ('negative','neg','positive','pos','detected','organism')\
        and (tokenized_length > counter+1)\
        and ((tokenized[counter-1] in ('a','original','or','level','of','the','tested','was','false')
              and tokenized[counter+1] in ('test','result','covid','new','at'))
             or tokenized[counter+1] in ('or','swab','to','contact','workers','retest','results',
                                         'son','person','patients','travel','individual','undergo')):
            useful[counter-1:counter+2] = [None]*3
        elif token in ('negative','neg','positive','pos','detected','organism','posivtive')\
        and (tokenized_length > 1)\
        and (tokenized[counter-2] in ('previous','previously','contact','worker','depot','targets',
                                      'being','unless','patient','law','due','exposure','needs','if',
                                      'swab','who')
             or tokenized[counter-1] in ('previous','previously','known','unit','first','second',
                                         'needs','need','requires','considered','swab','if',
                                         'depot','employee','gram','cx','member','coworker','shows',
                                         'father','contact','both','and')):
            useful[counter-1:counter+1] = [None]*2
        elif token in ('negative','neg','positive','pos','detected','organism')\
        and (tokenized_length > 2)\
        and (tokenized[counter-3] in ('mom','him','father')):
            useful[counter-2:counter+1] = [None]*3 
            
        #condition for word before no
        elif token == 'no' and (tokenized[counter-1] in ('as','by','lab','specimen','accession',
                                                         'sample','order','please','phl')
                                or any([pattern in tokenized[counter-1] for pattern in ('out','break','inv')]))\
        and tokenized[counter+1:counter+2] != ['virus'] and counter > 0:
            useful[counter-1:counter+1] = [None]*2
        
        #condition for word after no (cancel)
        elif token == 'no' and (tokenized_length > counter+1)\
        and tokenized[counter+1] in ('specimen','reportable','done','gene','result',
                                     'media','liquid','sample','swab','nasopharyngeal','record','fluid',
                                     'patient','second','results','testing','eluate','option','chose',
                                     'speicmen','label','validated','culture',):
            useful[counter] = 'r_can'

        #condition for due to
        elif tokenized[counter:counter+2] == ['due','to'] and 'new' not in tokenized[counter+2:counter+4]:
            useful[counter:counter+2] = ['stop']*2
                
        #condition for word after not (cancel)
        elif token == 'not' and (tokenized_length > counter+1) and \
        tokenized[counter+1] in ('tested','tessted','perform','performed','process','processed', 
                                 'transmit','suitable','done','doen','reported','received',
                                 'match','needed','labelled','available','symptomatic','forwared',
                                 'met','specified','indicated','returned','sufficient',
                                 'valid','required','able','needed','contain','ordered','recieved',
                                 'labeled','a','provided','appropriate','sent','send','remove',
                                 'report','rapid','found','applicable','rec','used','order',
                                 'matched','labled','proccessed','accepted','receivd','completed',
                                 'recollect','preformed','appearing','in','collected','obtained'):
            useful[counter:counter+2] = ['r_can']*2
        
        #condition for word before not
        elif token == 'not' and tokenized[counter-1] in ('does','did','please','done','over','swab','but','do'):
            useful[counter-1:counter+1] = ['reset']*2
        
        #condition for errors
        elif tokenized[counter:counter+3] in (['ordered', 'in', 'error'], ['no', 'covid', 'result']):
            useful[counter:counter+3] = ['r_can']*3

        # condition for note
        elif tokenized[counter:counter+2] in (['note','pho'],['note','specimens']): 
            useful[counter:counter+2] = ['end']*2        
        
        #condition for previous
        elif 'previous' in token and ('reported' in tokenized[counter+1:counter+3] or
                                      'specimen' in tokenized[counter+1:counter+2] or
                                      (tokenized[counter+1:counter+3] == ['report','covid'] and
                                           tokenized[counter-1] == 'the') or
                                      tokenized[counter+1:counter+3] in (['report','of'],
                                                                         ['reports','of'],
                                                                         ['reportof','covid'],
                                                                         ['result','of'],
                                                                         ['covid','19'],
                                                                         ['entered','covid'],
                                                                         ['report','as'],
                                                                         ['result','was'],
                                                                         ['report','that'])):
            useful[counter:counter+2] = ['end']*2
                
        #unable and no evidence - indeterminate
        elif tokenized[counter:counter+3] in (['unable','to','complete'],
                                              ['not','be','performed']):
            useful[counter:counter+3] = ['r_ind','connecting','connecting']
        elif tokenized[counter:counter+4] in (['unable','to','generate','sars'],):
            useful[counter:counter+4] = ['r_ind','connecting','t_seq','v_voc_general']    
    
        #SNP screen
        elif tokenized[counter:counter+3] in (['single','nucleotide','polymorphism'],
                                              ['mutation','associated','with']):
            useful[counter:counter+3] = ['t_scr']*3
        
        #sequencing instead of screening
        elif tokenized[counter:counter+7] == ['variant','screening','was','performed','by','sanger','sequencing'] \
                    and (tokenized_length > counter+1):
            useful[counter+1] = 'connecting'
          
        else:
            #indirect_matches dictionary
            for term, patterns in indirect_matches_dict.items():
                if any([pattern in token for pattern in patterns]):
                    useful[counter] = term
                    break
                    
            #direct_matches dictionary
            for term, patterns in direct_matches_dict.items():
                if any([pattern == token for pattern in patterns]):
                    useful[counter] = term
                    break
                    
            #test_type dictionary
            for test, patterns in test_type_dict.items():
                if any([pattern == token for pattern in patterns]):
                    useful[counter] = test
                    break
        
    return useful

### Algorithm description

Using the useful_tokens field, this interpret function sequentially "reads" the terms. It picks up virus/result/test terms and they are held in a "bundle" (virus, result, test). There are also multiple modifiers that affect the way that the algorithm processes the terms. These modifiers are: final (flag to take highest priority later on), presumptive (change pos to pre), end (end reading early or skip the next save), and skip (skip the 'save when virus switches' rule once). Any time a bundle is saved, the bundle (except for test type) and the final/presumptive modifiers are cleared. If a save occurs with incomplete information, the virus defaults to an unknown virus, result defaults to negative, and test defaults to unknown test. Whenever a save happens, all of the previous tokens+labels that were read are considered to be a "segment".
<br>
- First, the xml field is processed if there is one. If a relevant virus is found, it is treated as a positive and the bundle is saved.
- Next, the algorithm will go through the labelled tokens one by one. There are different conditions for storing terms and saving the bundle when encountering a virus, a result, a special term, or an irrelevant (unlabelled) term.
    - Viruses: A relevant virus is always kept. If the virus switches, save the bundle (note: can be affected by skip modifier). If the same virus is read, save the bundle only if there is a result as well. An unknown virus is only kept if there is no current virus.
    - Results: A clear result (ind, neg, pos) is kept with hierarchy ind > neg > pos such that a neg/ind can overwrite a positive if it's close together (e.g., "not detected" becomes a neg). An unclear result (rej, can, pen) is only kept if there is no current result with hierarchy rej > can > pen. If there is already a previous result and a neg/ind is encounter, save the bundle.
    - Special terms:
        - Final: Modifier to add flag when saving to specify whether it is a final result, which takes higher priority over all others in the process_result function. Save if there is a current virus and current result. Clear the current result.
        - Presumptive: Modifier to change positive (r_pos) into presumptive-positive (r_pre).
        - End: Modifier to skip the next save. Save if there is a current virus and current result. Stop the reading if there are any results.
        - Reset: Clear bundle without saving.
        - Stop: Save if there is a current virus and current result. Clear the bundle.        
    - Irrelevant terms: If two irrelevant terms (Nones) are read in a row, save the bundle if there is both a current result and virus. Also save the bundle if there is a virus and the past segment had another virus (virus_counter > 1; normally viruses tested are listed in a mpx or pcr assay). Otherwise, clear the bundle without saving and reset all the counter variables (i.e., start a new segment).
    - If the sentence ends before hitting two Nones, save any result and save the bundle if there is a virus and the past segment had another virus.

In [None]:
#interpret text to get initial results
def interpret(useful):
    
    def save(b):
        #presumptive modifier
        if b[1] == 'r_pos' and modifier[1]:
            b[1] = 'r_pre'

        #end modifier (skips a save)
        if not modifier[2] or modifier[0]:
            output.append([b[0] if b[0] else 'v_unk', 
                           b[1] if b[1] else 'r_neg', 
                           b[2] if b[2] else 't_unk', 
                           modifier[0]]) #final modifier
        
        b[0] = None
        b[1] = None
        modifier[0:4] = [False, False, False, False]
        return
    
    sentence = useful[:]
    output = []
    
    #bundle for current virus/result/test
    #0 = virus, 1 = result, 2 = test
    bundle = [None, None, None]
    
    #modifiers
    #0 = final, 1 = presumptive, 2 = end, 3 = skip
    modifier = [False, False, False, False]
    
    none_counter = 0 #counter for hitting consecutive irrelevant words
    virus_counter = 0 #counter for different viruses in same segment
    
    #xml field processing
    xml_pos = [i for i, x in enumerate(sentence) if x == 'xml']
    num = len(xml_pos)//2
    for i in range(num):
        xml_start_pos = xml_pos[i*2]
        xml_end_pos = xml_pos[i*2+1]
        for j in range(xml_start_pos, xml_end_pos + 1):
            if sentence[j] and sentence[j].startswith('v_') and sentence[j] != 'v_unk':
                bundle[0] = sentence[j]
                bundle[1] = 'r_pos'
                save(bundle)

    #loop on words in sentence
    for word in sentence:
        
        if word: #relevant term
            none_counter = 0 #restart counter
            
            #set current virus             
            if word.startswith('v_'):
                #different virus
                if word != 'v_unk' and word != bundle[0]:
                    #save current result if hitting a different virus
                    if bundle[0] and bundle[0] != 'v_unk' and bundle[1]:
                        save(bundle)
                    bundle[0] = word
                #same virus
                elif word != 'v_unk' and word == bundle[0]:
                    #save current result if there is one
                    if bundle[1]:
                        save(bundle)
                    bundle[0] = word
                #only set to general virus if there's no current virus
                elif word == 'v_unk' and not bundle[0]:
                    bundle[0] = word
            
            #set current result
            elif word.startswith('r_'):
                if word == 'r_ind':
                    if bundle[1]: 
                        save(bundle)
                    bundle[1] = word
                elif word == 'r_neg' and bundle[1] not in ('r_ind',):
                    if bundle[1]: 
                        save(bundle)
                    bundle[1] = word
                elif word == 'r_pos' and bundle[1] not in ('r_ind', 'r_neg'):
                    bundle[1] = word

                elif word in ('r_rej', 'r_can', 'r_pen') and bundle[1] not in ('r_ind', 'r_neg', 'r_pos'):
                    if word == 'r_rej':
                        bundle[1] = word
                    elif word == 'r_can' and bundle[1] not in ('r_rej',):
                        bundle[1] = word
                    elif word == 'r_pen' and bundle[1] not in ('r_rej', 'r_can'):
                        bundle[1] = word
                
            #set current test
            elif word.startswith('t_'):
                bundle[2] = word
            
            #final modifier
            elif word == 'final':
                if bundle[0] and bundle[1]:
                    save(bundle)
                modifier[0] = True
                bundle[1] = None #reset result
            
            #presumptive modifier
            elif word == 'presumptive':
                modifier[1] = True
            
            #end modifier/word
            elif word == 'end':
                #end early only if there is already result
                if bundle[0] and bundle[1]:
                    save(bundle)
                modifier[0:4] = [False, False, True, False] #end modifier skips next save
                #end early only if there is already result
                if len(output) > 0:
                    return output        
            
            #stop word
            elif word == 'stop':
                if bundle[0] and bundle[1]:
                    save(bundle)
                modifier[0:4] = [False, False, False, False]
                bundle[0] = None
                bundle[1] = None           
                
            #reset word
            elif word == 'reset':
                modifier[0:4] = [False, False, False, False]
                bundle[0] = None
                bundle[1] = None
            
        else: #word is None
            none_counter += 1
            
            if none_counter == 2: #can change threshold
                #save if there is current virus and result
                if bundle[0] and bundle[1]:
                    save(bundle)
                #reset
                none_counter = 0 
                virus_counter = 0
                bundle[0] = None
                bundle[1] = None
                modifier[0:4] = [False, False, False, False]
                
    #if there is still a remaining result
    if bundle[1]: 
        save(bundle)
            
    return output

In [None]:
#using reference excel to assign LOINCs to virus and test type
#added COVID19 LOINCs
xlsx_filename = 'COVID19_VOC_codes_20210315.xls'
mappings = {'--':'unk'}

df_loincs = pd.read_excel(xlsx_filename, sheet_name='VOC_LOINCs')

#cleaning the categories to match previously defined ones
df_loincs = df_loincs.replace(mappings)
df_loincs['Virus_to_assign'] = df_loincs['Virus_to_assign'].apply(lambda x: 'v_' + x)
df_loincs['Test_to_assign'] = df_loincs['Test_to_assign'].apply(lambda x: 't_' + x)

#assign LOINCs to virus and test type
loincs_by_v = {}
loincs_by_t = {}
for index, row in df_loincs.iterrows():
    loincs_by_v.setdefault(row['Virus_to_assign'], [])
    loincs_by_v[row['Virus_to_assign']].append(row['LOINCs'])
    loincs_by_t.setdefault(row['Test_to_assign'], [])
    loincs_by_t[row['Test_to_assign']].append(row['LOINCs'])

#remove the unk ones
# del loincs_by_v['v_unk']
del loincs_by_t['t_unk']

#use reference excel to assign TR codes to virus and test type
df_tr_codes = pd.read_excel(xlsx_filename, sheet_name='VOC_TRs')

#cleaning the categories to match previously defined ones
df_tr_codes = df_tr_codes.replace(mappings)
df_tr_codes['Virus_to_assign'] = df_tr_codes['Virus_to_assign'].apply(lambda x: 'v_' + x)
df_tr_codes['Test_to_assign'] = df_tr_codes['Test_to_assign'].apply(lambda x: 't_' + x)

#assign LOINCs to virus and test type
tr_codes_by_v = {}
tr_codes_by_t = {}
for index, row in df_tr_codes.iterrows():
    tr_codes_by_v.setdefault(row['Virus_to_assign'], [])
    tr_codes_by_v[row['Virus_to_assign']].append(row['TRs'])
    tr_codes_by_t.setdefault(row['Test_to_assign'], [])
    tr_codes_by_t[row['Test_to_assign']].append(row['TRs'])
    
#remove the unk ones
del tr_codes_by_v['v_unk']
# del tr_codes_by_t['t_unk']

In [None]:
# assign more details to v_unk or t_unk based on LOINC and TR code
# group by test type and then type of virus, remove duplicates
loinc_exclusions = ['10219-4','10182-4','11329-0','14869-2','21026-0','22633-2','22634-0','22635-7','22636-5','22637-3',
                    '22638-1','22639-9','31208-2','3150-0','33882-2','35265-8','41000-1','47526-9','49049-0','55752-0','56816-2','59465-5',
                    '59466-3','664-3','66746-9','76425-8','XON10007-3','XON10011-5','XON10313-5','XON10315-0','XON10316-8',
                    'XON10337-4','XON11913-1','XON12721-7','XON12875-1','XON13543-4','XON13544-2','XON13545-9',
                    '94558-4','94661-6']
tr_exclusions = ['TR12942-9']

def process_result(tokens, testrequestcode, observationcode, results, reportinglaborgname):
    dd = {}
    
    #LOINC/TR exclusions
    if (observationcode in loinc_exclusions) or (testrequestcode in tr_exclusions):
        return dd 
    
    ###extra conditions
    
    #ignore S gene mutation XON13583-0 detected 
#     if reportinglaborgname == 'Mount Sinai Hospital' and observationcode == 'XON13583-0' and tokens == ['detected']:
#         return dd
    
    for i in range(len(tokens)):
        #change negative to pending if there are results to follow
        if tokens[i:i+3] == ['to', 'follow', 'tested']:
            for r in results:
                if r[1] in ('r_neg','r_can','r_rej') and not r[3]:
                    r[1] = 'r_pen'      
              
    ###determine virus or test based on LOINC or TR
    v_from_loinc = [loinc_vir for loinc_vir, loincs in loincs_by_v.items() if observationcode in loincs]
    v_from_tr = [tr_codes_vir for tr_codes_vir, tr_codes in tr_codes_by_v.items() if testrequestcode in tr_codes]
    t_from_loinc = [loinc_test for loinc_test, loincs in loincs_by_t.items() if observationcode in loincs]
    t_from_tr = [tr_codes_test for tr_codes_test, tr_codes in tr_codes_by_t.items() if testrequestcode in tr_codes]
    
    #determine if there are any final/interpretation results
    viruses_with_final = [(v,t) for (v,r,t,f) in results if r in ('r_pos', 'r_pre', 'r_ind', 'r_neg', 'r_rej') and f]
    results_final = results
    #remove the non-final/interpretation results for viruses with final/interpretation
    for vf,tf in viruses_with_final:
        results_final = [(v,r,t,f) for (v,r,t,f) in results if not (v == vf and t == tf and not f)]
        
    for v, r, t, f in results_final:
        #fill in unknown virus
        if v == 'v_unk':
            if len(v_from_loinc) > 0:
                v = v_from_loinc[0]
            elif len(v_from_tr) > 0:
                v = v_from_tr[0]
        
        ## if any variant term present excluding voc_gen, assign t_seq
        if v.startswith('v_voc') and v != 'v_voc_general':
            t = 't_seq'
        
        #If test is unknown and any voc virus present, assign t_seq
        if t == 't_unk' and any([v.startswith('v_voc') and r in ('r_pos','r_pre','r_neg') for v,r,t,f in results]):
            t = 't_seq'
        
        #fill in unknown test
        if t == 't_unk':
            if len(t_from_loinc) > 0:
                t = t_from_loinc[0]
            elif len(t_from_tr) > 0:
                t = t_from_tr[0]

        if t == 't_unk' and (any(['screen' in t for t in tokens]) or v == 'v_sgene_n501y'): 
            t = 't_scr'
        elif t == 't_unk' and any(['sequenc' in t for t in tokens]): 
            t = 't_seq'
        elif t == 't_unk' and any([v.startswith('v_voc')for v,r,t,f in results]):
            t = 't_seq'
   
        # replace all v_sgene_mutation for t_scr with v_voc_general
        if v == 'v_sgene_mutation' and t == 't_scr':
            v ='v_voc_general'
            
        #additional logic for v == 'v_sgene_mutation' and t == 't_seq'??

        #remove unknown virus results
        if v != 'v_unk' and t != 't_unk':
            v, r, t = v[2:], r[2:], t[2:]
            dd.setdefault(t, [])
            
            #compiling results with hierarchy: S (presumptive positive) > P (positive) > N (negative)
            #                                  >  I (indeterminate) > D (pending) > R (invalid) > C (cancelled) 
            same_vir = False
            for i in range(len(dd[t])):
                if v == dd[t][i][0]:
                    same_vir = True
                    if r == 'pre':
                        dd[t][i] = (v,r)
                    elif r == 'pos' and dd[t][i][1] not in ('pre',):
                        dd[t][i] = (v,r)
                    elif r == 'neg' and dd[t][i][1] not in ('pre', 'pos'):
                        dd[t][i] = (v,r)
                    elif r == 'ind' and dd[t][i][1] not in ('pre', 'pos', 'neg'):
                        dd[t][i] = (v,r)
                    elif r == 'pen' and dd[t][i][1] not in ('pre', 'pos', 'neg', 'ind'):
                        dd[t][i] = (v,r)
                    elif r == 'rej' and dd[t][i][1] not in ('pre', 'pos', 'neg', 'ind', 'pen'):
                        dd[t][i] = (v,r)
                    elif r == 'can':
                        pass
            if not same_vir:
                dd[t].append((v,r))
        
    return dd

In [None]:
#create output as character value for each virus and test type
result_char = {'pre':'S', 'pos': 'P', 'neg':'N', 'ind':'I', 'pen':'D', 'can':'C', 'rej':'R'}

def char_output(results, ind):

    #loop through each test type and virus
    for t, pairs in results.items(): 
            for v, r in pairs:
                if v.startswith('voc'):
                    if '_general' in v:
                        df_results.at[ind, t+'_voc'] = result_char[r]
                    elif '_b117' in v:
                        df_results.at[ind, t+'_voc_b117'] = result_char[r]
                    elif '_b1351' in v:
                        df_results.at[ind, t+'_voc_b1351'] = result_char[r]
                    elif '_p1' in v:
                        df_results.at[ind, t+'_voc_p1'] = result_char[r]
                    elif '_p2' in v:
                        df_results.at[ind, t+'_voc_p2'] = result_char[r]
                    
                elif v.startswith('sgene'):
                    if '_n501y' in v:
                        df_results.at[ind, t+'_sgene_n501y'] = result_char[r]
                    elif '_e484k' in v:
                        df_results.at[ind, t+'_sgene_e484k'] = result_char[r]
                    elif '_k417n' in v:
                        df_results.at[ind, t+'_sgene_k417n'] = result_char[r]
                    elif '_k417t' in v:
                        df_results.at[ind, t+'_sgene_k417t'] = result_char[r]
                   
    return

### Initial processing

In [None]:
%%time
#make copy of df
df_unique = df_gp.copy(deep = True)

#clean text
df_unique["cleaned_value"] = df_unique["value"].apply(clean)

#group by unique records (org, TR code, Obs code, cleaned text) and store original indexes as tuple
df_unique = df_unique.reset_index()
groupby_vars = ['reportinglaborgname', 'testrequestcode', 'observationcode', 'cleaned_value']
df_unique = df_unique.groupby(groupby_vars).agg({'value': 'count', 
                                                 'original_indexes': lambda x: tuple([i for tup in x for i in tup])}).reset_index()
df_unique = df_unique.rename(columns={'value':'count'})

df_unique = df_unique.sort_values(by=['count'], ascending=False).reset_index(drop=True)
print('unique records after cleaning:', len(df_unique))

#tokenize
df_unique["cleaned_tokenized_value"] = df_unique["cleaned_value"].apply(tokenize)

### Assign results

In [None]:
#assign labels using dictionary
df_unique["useful_tokens"] = df_unique["cleaned_tokenized_value"].apply(assign_labels)

#interpret the labelled tokens
df_unique["initial_results"] = df_unique["useful_tokens"].apply(interpret)

# #fill in unknown viruses based on LOINC or TR code, roll up results to one test type
final_results = []
for i in range(len(df_unique)):
    final_results.append(process_result(df_unique["cleaned_tokenized_value"][i],
                                        df_unique["testrequestcode"][i],df_unique["observationcode"][i], 
                                        df_unique["initial_results"][i],df_unique["reportinglaborgname"][i]))

In [None]:
#translate results to 1-character format
# change output cols
result_cols = ['scr_voc','scr_sgene_n501y','seq_sgene_n501y',
               'seq_sgene_e484k','seq_sgene_k417n','seq_sgene_k417t',
               'seq_voc', 'seq_voc_b117', 'seq_voc_b1351','seq_voc_p1', 'seq_voc_p2']

#create empty df to fill in results
df_results = pd.DataFrame(index=np.arange(len(df_unique)), columns=['original_indexes']+result_cols)
df_results['original_indexes'] = df_unique['original_indexes']

#fill in results
for i in range(len(df_unique)):
    char_output(final_results[i], i)

#fill in seq_voc/scr_voc if there are any specific vocs
# take max of set of columns

result_mappings = {'S':0,'P':1,'N':2,'I':3,'D':4,'C':5,'R':6, np.nan:100}
result_mappings_output = {0:'S',1:'P', 2:'N', 3:'I',4:'D',5:'C',6:'R',100:np.nan}

for c in result_cols:
    df_results[c] = df_results[c].map(result_mappings)
    
df_results['scr_voc'] = df_results.apply(lambda row: min([row[c] for c in result_cols if c.startswith('scr')]),axis=1)
df_results['seq_voc'] = df_results.apply(lambda row: min([row[c] for c in result_cols if c.startswith('seq_voc')]) ,axis=1)

for c in result_cols:
    df_results[c] = df_results[c].map(result_mappings_output)

## Final output

In [None]:
#order results based on original_indexes
output = [None]*len(df)

# drop any columns not in result_cols - should not happen
for c in df_results.columns:
    if c not in result_cols and c!='original_indexes':
        print(c)
        df_results.drop(columns=c,inplace=True)
        
for row in df_results.itertuples():
    for i in row[1]: #original_indexes
        output[i] = tuple(row[2:])
        
if output_flag == 1:                
    df_output = pd.concat([df, pd.DataFrame(output, columns=result_cols)], axis=1)
elif output_flag == 2: 
    df_output = df_raw.join(df[['exclude_flag']].join(pd.DataFrame(output, columns=result_cols)))
    
else:
    print('PLEASE ENTER ONE OF THE FOLLOWING OPTIONS FOR OUTPUT_FLAG IN THE FIRST CELL: 1, 2')

In [None]:
#FINAL DATASET TO OUTPUT
df_output.to_csv(output_filename+'.csv', index=False)

In [None]:
df_output['scr_voc'].value_counts()
# df_output.describe()

## Roll Up 



In [None]:
output_filename_episodes = 'episodes_voc'
proper_TRCs = False

#run only on proper TRCs
if proper_TRCs:
    output_filename += '_tr'
    df_output = df_output[df_output.testrequestcode.isin(['TR12952-8','TR12953-6'])]

#remove observations based on observationresultstatus
print('Number of records removed with result status N/X: '+str(sum(df_output['observationresultstatus'].isin(('N','X')))))
df_clean = df_output[~df_output['observationresultstatus'].isin(('N','X'))]

print('Number of records removed with result status W (exclude_flag): '+str(sum(df_clean['exclude_flag'] == 'Y')))
df_clean = df_clean[df_clean['exclude_flag'] == 'N']

print('Number of records removed with blank patientid: '+str(sum(df_clean['patientid'] == '')))
df_clean = df_clean[df_clean['patientid'] != '']

print('Number of records remaining: '+str(len(df_clean)))

In [None]:
#hierarchy: P > N > I > D > C > R 
#P = Positive,  N = Negative, I = Indeterminate, D = penDing, C = Cancelled, R = Rejected/invalid
result_mappings = {'P':1,'N':2,'I':3,'D':4,'C':5,'R':6, np.nan:100,'':100}
result_mappings_output = {1:'P', 2:'N', 3:'I',4:'D',5:'C',6:'R',100:''}
result_tf_output = {1:'T', 0:'F'}

#assign S (presumptive-positive) as P (positive)
#convert result variable (from previous script) to number for hierarchy
for c in result_cols:
    df_clean.loc[df_clean[c] == 'S', c] = 'P'
    df_clean[c] = df_clean[c].map(result_mappings)

blank_col_count = 100*len([c for c in result_cols if c.startswith('seq')])

# scr/seq flag
df_clean['scr_flag'] = df_clean.apply(lambda row: 1 if row['scr_voc'] < 100 else 0, axis=1)
df_clean['seq_flag'] = df_clean.apply(lambda row: 1 if sum([row[c] for c in result_cols if c.startswith('seq')]) < blank_col_count else 0, axis=1)


# keep rows with any result
df_clean = df_clean[(df_clean.scr_flag == 1) | (df_clean.seq_flag == 1)]

# sqr/seq completed: observations that have a clear covid result (P, N, I, D) 
df_clean['scr_test'] = df_clean['scr_voc'].apply(lambda row: 1 if row in (1,2,3,4) else 0)
df_clean['seq_test'] = df_clean.apply(lambda row: 1 if any([row[c] in (1,2,3,4) for c in result_cols if c.startswith('seq')]) else 0,axis=1)


#date versions of datetime
df_clean['specimenreceiveddate'] = df_clean['specimenreceiveddatetime'].apply(lambda x: np.datetime64(x, 'D'))
df_clean['observationdate'] = df_clean['observationdatetime'].apply(lambda x: np.datetime64(x, 'D'))
df_clean['observationreleasedate'] = df_clean['observationreleasets'].apply(lambda x: np.datetime64(x, 'D'))


#drop used columns
df_clean.drop(['seq_flag','scr_flag','observationresultstatus','exclude_flag','observationcode','observationdatetime','specimenreceiveddatetime'], 1, inplace=True)

In [None]:
### ROLL UP to EPISODE

# for episode, each record with latest obsreleasets scr/seq completed
df_episodes = df_clean.groupby(['patientid','observationdate']).agg({'seq_test':'max','scr_test':'max'}).reset_index()

# for each result column, take highest priority result (clear result > latest release timestamp > result hierarchy)
# also collect ordersids where results were taken
for c in result_cols:
    t = c + '_test'
    
    df_temp = df_clean[['patientid','observationdate','observationreleasets',c,'ordersid']].copy()
    df_temp[t] = df_temp[c].apply(lambda x: 1 if x in (1,2,3,4) else 0)
    df_temp = df_temp.sort_values(['patientid','observationdate',t,'observationreleasets',c,'ordersid'], 
                           ascending=[True,True,False,False,True,True]).\
                groupby(['patientid','observationdate']).first().reset_index() 
    df_temp.rename(columns={'ordersid':'ordersid_'+c},inplace=True)
    df_temp.drop(columns=['observationreleasets',t],inplace=True)
    df_episodes = df_episodes.merge(df_temp,how='left',on=['patientid','observationdate'])
    
df_episodes['final_result_ordersids'] = df_episodes.apply(lambda x: ','.join(sorted(list(set([str(int(x[c])) for c in df_episodes.columns \
                                                                     if c.startswith('ordersid_')])))),axis=1)

df_episodes.drop(columns=[x for x in df_episodes.columns if x.startswith('ordersid_')], inplace=True)

#taking the max observationreleasedate/specimenreceiveddate for each test type
for c in ['observationreleasedate','specimenreceiveddate']:
    for t in ['scr','seq']:
        df_temp = df_clean[df_clean[t+'_test']==1].groupby(['patientid','observationdate']).\
                            agg({c:'max'}).\
                            rename(columns={c:c+'_'+t}).\
                            reset_index()
        df_episodes = df_episodes.merge(df_temp,how='left',on=['patientid','observationdate'])

In [None]:
# Map numeric results back to text
for c in result_cols:
    df_episodes[c] = df_episodes[c].map(result_mappings_output)
    
df_episodes['scr_test']=df_episodes['scr_test'].map(result_tf_output)
df_episodes['seq_test']=df_episodes['seq_test'].map(result_tf_output)

# Output file
df_episodes.to_csv(output_filename_episodes+'.csv', index=False)    

## Manual review

In [None]:
#tracker for unique records (some records may be marked as new if clean function changes)

#initialize tracker
try:
    f = open('record_tracker.pkl')
    f.close()
except FileNotFoundError:
    df_tracker = pd.DataFrame(columns=['filename', 'reportinglaborgname', 'testrequestcode', 'observationcode', 'cleaned_value'])
    df_tracker.to_pickle("./record_tracker.pkl")
    print('CREATING RECORD TRACKER FILE')
    
#read tracker
df_tracker = pd.read_pickle('./record_tracker.pkl')

#RESET TRACKER
#df_tracker = df_tracker.iloc[0:0]

df_tracker_orig = df_tracker[['reportinglaborgname', 'testrequestcode', 'observationcode', 'cleaned_value']].copy(deep = True)
df_tracker_delta = df_unique[['reportinglaborgname', 'testrequestcode', 'observationcode', 'cleaned_value']].copy(deep = True)

#set difference
df_tracker_delta = pd.concat([df_tracker_delta, df_tracker_orig, df_tracker_orig], ignore_index=True).drop_duplicates(keep=False)
print('Original tracker length:', len(df_tracker_orig))
print('Delta tracker length:', len(df_tracker_delta))

In [None]:
#intermediate output for checking results
int_output_cols = ['count', 'reportinglaborgname', 'testrequestcode', 'observationcode', 'cleaned_value']
df_unique[int_output_cols].join(df_results.drop(columns=['original_indexes'])).to_csv('intermediate_output.csv')
df_unique[int_output_cols][df_unique.index.isin(df_tracker_delta.index)].join(df_results.drop(columns=['original_indexes'])).to_csv('intermediate_output_delta.csv')

In [None]:
#FINALIZE THE RECORD TRACKER (only run when you are satisfied with the review process)
#add filename
df_tracker_delta['filename'] = input_filename

#add the delta
df_tracker = pd.concat([df_tracker, df_tracker_delta], sort=False, ignore_index=True)

#save file
df_tracker.to_pickle("./record_tracker.pkl")
print('Records in tracker:', len(df_tracker))

#cleanup
del df
del df_unique
del df_output
del df_tracker
del df_tracker_orig
del df_tracker_delta

In [None]:
# # % positivity numbers

# # patient level - SCREENING
# df_pat = df_episodes[['patientid','scr_voc','scr_test']]
# df_pat = df_pat.groupby(['patientid','scr_test'],as_index=False).agg({'scr_voc':'min'})

# print("PATIENT LEVEL SCREENING")
# print(sum(df_pat.scr_voc==1))
# print(sum(df_pat.scr_test==1))
# print(sum(df_pat.scr_voc==1)/sum(df_pat.scr_test==1))

# # patient level - SCREENING or SEQUENCING
# df_pat = df_episodes[['patientid','scr_voc','scr_test','seq_voc','seq_test']]
# df_pat = df_pat.groupby(['patientid','scr_test','seq_test'],as_index=False).agg({'scr_voc':'min','seq_voc':'min'})

# print("PATIENT LEVEL SCREENING OR SEQUENCING")
# print(sum((df_pat.scr_voc==1)|(df_pat.seq_voc==1)))
# print(sum((df_pat.scr_test==1)|(df_pat.seq_test==1)))
# print(sum((df_pat.scr_voc==1)|(df_pat.seq_voc==1))/sum((df_pat.scr_test==1)|(df_pat.seq_test==1)))

# # episodes level - SCREENING
# print("EPISODE LEVEL SCREENING")
# print(sum(df_episodes.scr_voc==1))
# print(sum(df_episodes.scr_test==1))
# print(sum(df_episodes.scr_voc==1)/sum(df_episodes.scr_test==1))

# # episodes level - SCREENING or SEQUENCING
# print("EPISODE LEVEL SCREENING OR SEQUENCING")
# print(sum((df_episodes.scr_voc==1)|(df_episodes.seq_voc==1)))
# print(sum((df_episodes.scr_test==1)|(df_episodes.seq_test==1)))
# print(sum((df_episodes.scr_voc==1)|(df_episodes.seq_voc==1))/sum((df_episodes.scr_test==1)|(df_episodes.seq_test==1)))

## Testing and validation

In [None]:
#test a string
test_string = r'''

'''

test_clean = clean(test_string)
print('----', test_clean)
test_useful = assign_labels(tokenize(test_clean))
print('----', test_useful)
test_interpret = interpret(test_useful)
print('----', test_interpret)