# COVID-19 Lab Results
Written by: Branson Chen, Danson Lou, Gulam Mohammed, Vicky Zhang, Harrison Zhang <br>
Last modified: 20250508

## Notes:
#### 1. Please make sure your input file is sorted by "ordersid".  
#### 2. In the "Function Definition" , when importing the input file please make sure you have selected the correct datetime format.
#### 3. Please go to "Parameter Definition" to define all of the required parameters.  After doing so, you can run all script. 
#### 4. The code also generates two more csv files: intermediate_output and intermediate_output_delta. You can commend it out if you don't need them.

## Table of Contents

<a href='#Overview'>Overview</a><br>
<a href='#Import library'>Import library</a><br>
<a href='#Function-definition'>Function definition</a><br>
<a href='#Process-Data'>Process data</a><br>

- <a href='#Parameter-Definition'>Parameter definiton</a><br>
- <a href='#Process-Data-in-Batches'>Process Data in Batches</a><br>
- <a href='#Combine-Batch-Outputs-to-the-Final-Output'>Combine Batch Outputs to the Final Output</a><br>
- <a href='#Remove-Batch-Outputs'>Remove Batch Outputs</a><br>

<a href='#Algorithm description'>Algorithm description</a><br>


## Overview

- This script first imports a SAS/CSV file based on the input variables provided, and then fields are decoded/renamed.
- Next, the text is cleaned (clean function) and then tokenized (tokenize function).
- Relevant labels are then assigned to the tokens (assign_labels function).
- The labelled tokens are then interpreted using an in-house algorithm (interpret function).
- All of the information from the previous step is then collapsed to give one result per virus per test (process_result function), and unidentified virus/test types are filled in based on observation codes and testrequest codes.
- Lastly, the results are converted to a single character per virus type (char_output function) and then output in a csv.

## Import library

In [None]:
import pandas as pd
import numpy as np
import pyreadstat.pyreadstat
import nltk
import re
import datetime
import time
import os

##IntegrationPoint01_Open
##IntegrationPoint01_Close

 # Function definition

In [None]:
def processBatch(batch, last_batch_cut_index, batchSize, input_path, input_filename, input_patientid_var, output_filename, output_flag = 1):
    
    def find_current_cutpoint():

        def get_ordersid_at_index(index): 
            # return the ordersid of the [index+1]th row. 
            # if the index is same or larger than the range of dataset, it will return False
            if input_filename.endswith('.csv'):
                ##IntegrationPoint06_Open
                row_df = pd.read_csv(input_path+input_filename,
                        skiprows=index, nrows=1, header=0,
                        names=pd.read_csv(input_path+input_filename, nrows=1).columns.values)
                ##IntegrationPoint06_Close
            elif input_filename.endswith('.sas7bdat'):
                row_df, _ = pyreadstat.read_sas7bdat(filename_path=input_path+input_filename, 
                                     row_limit=1, row_offset=index)
            else:
                raise Exception("file is not csv or sas7bdat")
            return row_df.iloc[0]['ordersid'] if len(row_df) == 1 else False
        
        current_file_cutpoint = (int(last_batch_cut_index/batch_size) + 1) * batch_size
        current_last_ordersid = get_ordersid_at_index(current_file_cutpoint - 1)
        if not current_last_ordersid: # already reached end of file
            return current_file_cutpoint
        while True:
            current_file_cutpoint += 1
            current_cutpoint_ordersid = get_ordersid_at_index(current_file_cutpoint - 1)
            if not current_cutpoint_ordersid: # reached end of dataset
                return current_file_cutpoint - 1 # reached end of file
            elif current_cutpoint_ordersid != current_last_ordersid:
                return current_file_cutpoint - 1 # found the row with different ordersid
            
    #if you want to use this code:
    current_batch_cut_index = find_current_cutpoint()
    
    #if you want to implement your solution, and don't need find_current_cutpoint funtion:
#    current_batch_cut_index = (batch+1) * batchSize
    
    ##IntegrationPoint02_Open
    #load data for csv file
    if input_filename.endswith('.csv'): 
        df = pd.read_csv(input_path+input_filename,
                        skiprows=last_batch_cut_index, nrows=current_batch_cut_index-last_batch_cut_index, header=0,
                        names=pd.read_csv(input_path+input_filename, nrows=1).columns.values, 
                        dtype={input_patientid_var: 'string',
                               'observationsubid': 'string',
                               'fillerordernumber': 'object',
                               'observationcode': 'object',
                               'observationdatetime': 'object',
                               'observationreleasets': 'object',
                               'observationresultstatus': 'object',
                               'observationvalue': 'object',
                               'ordersid': 'int64',
                               'performinglaborgname': 'object',
                               'reportinglaborgname': 'object',
                               'testrequestcode': 'object'})
        
#         df['observationdatetime']=pd.to_datetime(df['observationdatetime'], format='%Y-%m-%d %H:%M:%S.%f0')
#         df['observationreleasets']=pd.to_datetime(df['observationreleasets'], format='%Y-%m-%d %H:%M:%S.%f0')        
        df['observationdatetime']=pd.to_datetime(df['observationdatetime'], format='%d%b%y:%H:%M:%S')
        df['observationreleasets']=pd.to_datetime(df['observationreleasets'], format='%d%b%y:%H:%M:%S')
        
    #load data for sas file
    elif input_filename.endswith('.sas7bdat'):
        df, _ = pyreadstat.read_sas7bdat(filename_path=input_path+input_filename, 
                                     row_limit=current_batch_cut_index-last_batch_cut_index,
                                     row_offset=last_batch_cut_index)

    ##IntegrationPoint02_Close

    if len(df) < 1:
        return -1
    print(f'---------------------- starting batch {batch} ----------------------')
    
    if input_filename.endswith('.sas7bdat'): 
        df['observationdatetime'] = pd.to_timedelta(df['observationdatetime'], unit='s') + datetime.datetime(1960,1,1)
        df['observationreleasets'] = pd.to_timedelta(df['observationreleasets'], unit='s') + datetime.datetime(1960,1,1)

    #decode strings (np objects)
    #df1.loc[:, df1.dtypes == np.object] = df1.loc[:, df1.dtypes == np.object].apply(lambda x: x.str.decode('UTF-8'))


    df.fillna('', inplace=True)
    print('# of records:',len(df))

    if output_flag != 1:
        df_raw = df.copy(deep = True)

    #rename variables
    df = df.rename(columns={input_patientid_var:'patientid','fillerordernumber':'fillerordernumberid',
                           'observationvalue':'value','observationsubid':'subid'})
    #keep key cols
    key_cols = ['patientid', 'ordersid', 'fillerordernumberid', 
                'reportinglaborgname', 'performinglaborgname', 'observationdatetime', 
                'testrequestcode', 'observationcode', 'observationreleasets', 
                'observationresultstatus', 'subid', 'value']
    df = df[key_cols]

    #set exclude_flag based on observationresultstatus = W
    df_W = df.loc[df['observationresultstatus'] == 'W', ['ordersid', 'observationcode', 'value']]
    df_excl = df[['ordersid', 'observationcode', 'value']].reset_index().merge(df_W, how='inner').set_index('index')
    df['exclude_flag'] = 'N'
    df.loc[df.index.isin(df_excl.index),['exclude_flag']] = 'Y'
    print(df['exclude_flag'].value_counts())

    #set exclude_flag based on DO NOT TRANSMIT code
    # DNT_text = '<p1:MicroOrganism xmlns:p1="http://www.ssha.ca"><p1:Code>99999999999</p1:Code><p1:Text>Do Not Transmit</p1:Text><p1:CodingSystem>HL79905</p1:CodingSystem></p1:MicroOrganism>'
    # df_DNT = df.loc[df['value'] == DNT_text, ['ordersid', 'observationcode','observationreleasets']]
    # df_excl2 = df[['ordersid', 'observationcode', 'observationreleasets']].reset_index().merge(df_DNT, how='inner').set_index('index')
    # df.loc[df.index.isin(df_excl2.index),['exclude_flag']] = 'Y'
    # print(df['exclude_flag'].value_counts())

    #%%time
    #determine which observations need to be concatenated
    group_cols = ['ordersid', 'fillerordernumberid', 'reportinglaborgname', 
                  'testrequestcode', 'observationcode', 'observationreleasets', 'observationresultstatus']
    df_gp_subid = df.reset_index().groupby(group_cols).agg({'index':tuple, 'subid':tuple}).reset_index()
    df_gp_subid = df_gp_subid.rename(columns={'index':'original_indexes'})

    #only concatenate ones where there are more than two subids, all the subids are numbers and contains 1
    df_to_concat = df_gp_subid[df_gp_subid['subid'].apply(lambda x: all([subid.isdigit() for subid in x]) and len(x) > 2 and '1' in x)]
    concat_indexes = [i for tup in df_to_concat['original_indexes'] for i in tup]

    #concatenate based on subid
    df_gp_concat = df[df.index.isin(concat_indexes)].reset_index()
    df_gp_concat['subid'] = df_gp_concat['subid'].apply(int)
    df_gp_concat = df_gp_concat.sort_values(by = group_cols+['subid']).groupby(group_cols)
    df_gp_concat = df_gp_concat.agg({'index': tuple,
                       'value': lambda x: ' '.join(map(str, x))}).reset_index()

    #add on records that were not concatenated
    df_gp = df.loc[~df.index.isin(concat_indexes), group_cols+['value']].reset_index()
    df_gp['index'] = df_gp['index'].apply(lambda x: (x,))
    df_gp = pd.concat([df_gp_concat, df_gp], sort=False).rename(columns={'index':'original_indexes'})

    #narrow down columns of df
    df_cols = ['patientid','ordersid','fillerordernumberid','observationdatetime','testrequestcode',
               'observationcode','observationreleasets', 'observationresultstatus','exclude_flag']
    df = df[df_cols]

    print('# of TEST RESULTS:', len(df_gp))

    #cleanup
    del df_W
    del df_excl
    del df_gp_subid
    del df_to_concat
    del concat_indexes
    del df_gp_concat

    #clean punctuation, xml field, numbers, other text
    puncs = [';', ':', ',', '.', '-', '_', '/', '(', ')', '[', ']', '{', '}', '<', '>', '*', '#', '?', '.', '+', 
            'br\\', '\\br', '\\e\\', '\\f\\', '\\t\\', '\\r\\', '\\', "'", '"', '=']
    terms_to_space = ['detected', 'by', 'positive', 'parainfluenza', 'accession']
    nums_following = ['date', 'telephone', 'tel', 'phone', 'received', 'collected',  
                     'result', 'on', 'at', '@', 'approved', 'final', 'time', 'number']
    strings_to_replace = {'non detected':'not detected','nt detected':'not detected' ,'npot detected':'not detected', 
                          'nor detected':'not detected', 'mot detected':'not detected', 
                          'n0t detected':'not detected', 'nit detected':'not detected','agenot detected':'not detected',
                          'covid 19 virus not interpretation detected':'covid 19 virus interpretation not detected',
                          'presumptive interpretation':'interpretation presumptive',
                          'preliminary interpretation':'interpretation preliminary',
                          'covid 19 not detected and covid 19 detected':'covid 19 detected and covid 19 not detected',
                          'virusnot':'virus not', 'prevuous':'previous'}
    date_id_patterns = [r'\d{2,4} \d{2} \d{2,4} ', r'\d{4} \d{2} ', r'\d{4}h ', 
                       r' \d{0,2}[a-z]{0,2}\d{5,}[a-z]{0,1}', r' [a-z]{0,2}\d{1,3}[a-z]{1,3}\d{4,}[a-z]{0,1}',
                       r' \d{2}[a-z]{1}\d{3}[a-z]{2}\d{4}', r' [a-z]{4,}\d{7,}']

    def clean(value):
        cleaned = value.lower()

        #clean xml field, only keep text field surrounded with 'p1 text'
        pattern = r'(<p1:microorganism xmlns)(.+)(<p1:text>.+</p1:text>)(.+)(</p1:microorganism>)'
        while re.search(pattern, cleaned):
            cleaned = re.sub(pattern, r'\g<3>', cleaned)

        #surround terms with spaces (some terms found stuck together)
        for t in terms_to_space:
            cleaned = cleaned.replace(t, ' ' + t + ' ')

        #replace punctuation with space
        for punc in puncs:
            cleaned = cleaned.replace(punc, ' ')

        #remove consecutive spaces
        while '  ' in cleaned:
            cleaned = cleaned.replace('  ', ' ')

        cleaned = cleaned.strip()     

        #remove numbers after certain terms
        for term in nums_following:
            pattern = term + r' \d{1,4}'

            while re.search(pattern, cleaned):
                cleaned = re.sub(pattern, term, cleaned)

        #remove more dates and ids
        for pattern in date_id_patterns:
            while re.search(pattern, cleaned):
                cleaned = re.sub(pattern, '', cleaned)

        #remove numbers at the end
        while len(cleaned) > 0 and (cleaned[-1].isdigit() or cleaned[-1] == ' '):
            cleaned = cleaned[:-1]

        #remove "no" at the end
        while cleaned.endswith(' no') or cleaned == 'no':
            cleaned = cleaned[:-3]

        #fix certain strings
        for k, v in strings_to_replace.items():
            cleaned = cleaned.replace(k, v)

        return cleaned

    #tokenize values using nltk
    def tokenize(value):
        tokenized = nltk.word_tokenize(value)

        return tokenized

    #assign labels for useful tokens based on some dictionaries and exclusions
    easy_virus_dict = {'v_adenovirus':['aden'], 'v_bocavirus':['boca', 'bocca'], 'v_coronavirus':['coro', 'cora'],
                       'v_entero_rhino':['enterol', 'enterov', 'entervir', 'rhino', 'rhini'], 'v_hmv':['metap']}
    hard_virus_dict = {'v_rsv':['rsv'], 'v_flu':['nflu', 'flue','flua','flub'], 'v_para':['parai', 'pata', 'parta'],
                       'v_covid':['cov', 'sars', 'orf1', 'orfl', 'or1lab']} #fluvid 
    indirect_matches_dict = {'r_pos': ['posi','pos1','covpos'], 
                             'r_neg': ['neg', 'naeg', 'neag'],  
                             'r_ind': ['indeter', 'eterminate', 'inconclu', 'inderter',
                                       'equivocal', 'unresolved'],
                             'r_can': ['cancel', 'incorrect', 'duplicate', 'mislabel','unlabel',
                                       'recollect', 'mistaken', 'wrong', 'redirect'],
                             'r_rej': ['reject', 'inval', 'leak', 'unable', 'insuffic', 
                                       'spill', 'inapprop', 'nsq', 'poor', 'uninterpret'],
                             'presumptive': ['presump', 'prelim', 'possi'], 
                             'retest': ['retest']} #'sent', 'send', 'forwarded'
    direct_matches_dict = {'r_pos': ['detected', 'pos', 'deteced', 'postive', 'organism','isolated'],
                           'r_neg': ['no', 'not'],
                           'r_ind': ['ind'],
                           'r_pen': ['pending', 'progress', 'follow', 'ordered', 'reordered','reorder'],
                           'presumptive': ['single', 'possible', 'probable'],
                           'xml': ['p1'], 
                           'reset': ['deleted','anesthesiologist'],
                           'stop': ['specific', 'required', 'error', 'copy', 'see', 'laboratory',
                                    'note', 'stability', 'changed', 'recollect', 'moh', 'if', 'before'],
                           'final': ['interpretation', 'interpetation', 'interp', 'pretation', 'interpretive',
                                     'final', 'overall', 'corrected', 'proved', 'correct','current'],
                           'skip': ['reason', 'identify', 'confirmation'],
                           'end': ['mutation','voc','vocs','variant','variants','serology'],
                           'connecting': ['screen', 'presence', 'as', 'real',
                                          'is', 'of', 'in', '1', '2', '3', '4', 'a', 'b', 'c',
                                          '229e', 'nl63', 'hku1', 'oc43', '19', '2019', 'low',
                                          'biosafety', 'hazard', 'has', 'been', 'for', 'changed', 'identified', 
                                          'result', 'other', 'testing', 'using', 'to', 'from', 'tested',
                                          'phl', 'phol', 'phlo', 'new', 'request', 'lab', 'will',
                                          'panel', 'seasonal', 'human', 'report', 'said', 'updated', 'dob',
                                          'requisition', 'form','label'
                                          ]}
    test_type_dict = {'t_oth': ['eia', 'rapid', 'immunoassay', 'ict', 'immunochromatographic', 'antigen'], 
                      't_pcr': ['multiplex', 'naat', 'nat', 'pcr', 'rrt', 'rna', 'gen', 
                                'reverse', 'polymerase', 'chain', 'simplexa'],
                      't_gene': ['gene', 'targets', 'tagets', 'target']}

    def assign_labels(tokenized):
        tokenized_length = len(tokenized)
        useful = [None]*tokenized_length #store same list length of tokens and update each accordingly

        for counter, token in enumerate(tokenized):

            #skip if already assigned
            if useful[counter]:
                continue

            ###easy viruses dictionary (non-exact matching)
            for virus, patterns in easy_virus_dict.items():
                if any([pattern in token for pattern in patterns]):
                    useful[counter] = virus
                    break

            #extra rhino/entero rule (exact matching)
            if token in ('rhino', 'entero'):
                useful[counter] = 'v_entero_rhino'

            ###hard viruses dictionary (non-exact matching)

            #COVID19 
            #e/envelope/n/nucleocapsid/s/spike gene
            elif token in ('e', 'envelope', 'n', 'nucleocapsid', 's', 'spike') and (tokenized_length > counter+1)\
            and tokenized[counter+1] == 'gene':
                useful[counter:counter+2] = ['v_covid', 't_gene']

            #rdrp gene
            elif token == 'rdrp' and (tokenized_length > counter+1)\
            and tokenized[counter+1] == 'gene' and 'v_coronavirus' not in useful[:counter]:
                useful[counter:counter+2] = ['v_covid', 't_gene']

            #orf1ab
            elif any([pattern in token for pattern in ['orf1','orfl','or1lab']]) and (tokenized_length > counter+1)\
            and tokenized[counter+1] == 'gene' and 'mers' not in tokenized[counter-3:counter]:
                useful[counter:counter+2] = ['v_covid', 't_gene']

            elif any([pattern in token for pattern in hard_virus_dict['v_covid']])\
            and not any([pattern in token for pattern in ('ecov', 'cove', 'covpos')])\
            and not any([word==pattern for word in tokenized[counter-3:counter] 
                         for pattern in ('mers', 'caesarean', 'xpress')])\
            and not tokenized[counter-1] == 'non':
                useful[counter] = 'v_covid'

            #extra rule for seasonal coronavirus, if preceded by novel or followed by 19/disease/cov/sars/2
            elif any([pattern in token for pattern in easy_virus_dict['v_coronavirus']]):
                if 'nove' in tokenized[counter-1] or tokenized[counter-1] == 'nivel':
                    useful[counter-1:counter+1] = ['connecting', 'v_covid']

                covid_extra = [] #extra terms
                look_forward = 3 #how many terms to look forward for
                max_forward = min(counter+look_forward, tokenized_length-1) #limit if record is too short
                covid_extra = [(tokenized[covid_pos], covid_pos) for covid_pos in range(counter+1, max_forward+1)\
                           if any([pattern in tokenized[covid_pos] for pattern in ('19', 'disea', 'cov', 'sars')]\
                                  +[tokenized[covid_pos] == '2'])]

                #assign range of relevant tokens as virus change and 'non' not in tokenized[counter+1:max_forward+1]
                if len(covid_extra) > 0 and 'non' not in tokenized[counter+1:max_forward+1]:
                    last_pos = max([x[1] for x in covid_extra])
                    useful[counter:last_pos+1] = ['v_covid']+['connecting']*(last_pos-counter)
                else:
                    pass

            #PARA
            elif any([pattern in token for pattern in hard_virus_dict['v_para']]+[token == 'para'])\
            and tokenized[counter-1] != 'haemophilus':
                para_extra = []
                look_forward = 5
                max_forward = min(counter+look_forward, tokenized_length-1)
                para_extra = [(tokenized[para_pos], para_pos) for para_pos in range(counter+1, max_forward+1)\
                                  if tokenized[para_pos] in ('1','2','3','4')]

                if len(para_extra) > 0:
                    last_pos = max([x[1] for x in para_extra])
                    para_nums = [x[0] for x in para_extra]
                    useful[counter:last_pos+1] = ['v_para_' + '_'.join(para_nums)]+['connecting']*(last_pos-counter)
                else:
                    useful[counter] = 'v_para'

            #FLU
            elif any([pattern in token for pattern in hard_virus_dict['v_flu']]+[token in ('flu', 'inf')])\
            and tokenized[counter-1] != 'haemophilus' and tokenized[counter] != 'fluab':
                flu_extra = []
                look_forward = 4
                max_forward = min(counter+look_forward, tokenized_length-1)

                for flu_pos in range(counter+1, max_forward+1):
                    if tokenized[flu_pos] in ('a','b') or 'h1' in tokenized[flu_pos] or 'h3' in tokenized[flu_pos]:
                        flu_extra.append((tokenized[flu_pos], flu_pos))
                    elif 'flu' in tokenized[flu_pos]: #to deal with influenza a influenza b
                        break

                if len(flu_extra) > 0:
                    last_pos = max([x[1] for x in flu_extra])
                    flu_types = [x[0] for x in flu_extra]
                    if 'a' in flu_types and 'b' in flu_types:
                        useful[counter:last_pos+1] = ['v_flu_a_b']+['connecting']*(last_pos-counter)
                    elif 'b' in flu_types:
                        useful[counter:last_pos+1] = ['v_flu_b']+['connecting']*(last_pos-counter)
                    elif any(['h1' in f for f in flu_types]) and any(['h3' in f for f in flu_types]):
                        useful[counter:last_pos+1] = ['v_flu_a_h1_h3']+['connecting']*(last_pos-counter)
                    elif any(['h1' in f for f in flu_types]):
                        useful[counter:last_pos+1] = ['v_flu_a_h1']+['connecting']*(last_pos-counter)
                    elif any(['h3' in f for f in flu_types]):
                        useful[counter:last_pos+1] = ['v_flu_a_h3']+['connecting']*(last_pos-counter)
                    elif 'a' in flu_types:
                        useful[counter:last_pos+1] = ['v_flu_a']+['connecting']*(last_pos-counter)                                                                  
                elif token.endswith('aa'):
                    useful[counter] = 'v_flu_a'
                elif token.endswith('ab'):
                    useful[counter] = 'v_flu_b'
                else:
                    useful[counter] = 'v_flu'

            #RSV
            elif any([pattern in token for pattern in hard_virus_dict['v_rsv']]):
                rsv_extra = []
                look_forward = 2
                max_forward = min(counter+look_forward, tokenized_length-1) 
                rsv_extra = [(tokenized[rsv_pos], rsv_pos) for rsv_pos in range(counter+1, max_forward+1)\
                           if tokenized[rsv_pos] in ('a','b')]

                if len(rsv_extra) > 0:
                    last_pos = max([x[1] for x in rsv_extra])
                    rsv_types = [x[0] for x in rsv_extra]
                    if 'a' in rsv_types and 'b' in rsv_types:
                        useful[counter:last_pos+1] = ['v_rsv_a_b']+['connecting']*(last_pos-counter)
                    elif 'a' in rsv_types:
                        useful[counter:last_pos+1] = ['v_rsv_a']+['connecting']*(last_pos-counter)
                    elif 'b' in rsv_types:
                        useful[counter:last_pos+1] = ['v_rsv_b']+['connecting']*(last_pos-counter)
                else:
                    useful[counter] = 'v_rsv'

            elif (tokenized_length > counter+2) and ((token.startswith('resp')\
            and tokenized[counter+1].startswith('syn') and tokenized[counter+2].startswith('vi'))\
            or (token == 'r' and tokenized[counter+1] == 's' and tokenized[counter+2] == 'v')):
                rsv_extra = []
                look_forward = 4
                max_forward = min(counter+look_forward, tokenized_length-1) 
                rsv_extra = [(tokenized[rsv_pos], rsv_pos) for rsv_pos in range(counter+3, max_forward+1)\
                           if tokenized[rsv_pos] in ('a','b')]

                if len(rsv_extra) > 0:
                    last_pos = max([x[1] for x in rsv_extra])
                    rsv_types = [x[0] for x in rsv_extra]
                    if 'a' in rsv_types and 'b' in rsv_types:
                        useful[counter:last_pos+1] = ['v_rsv_a_b']+['connecting']*(last_pos-counter)
                    elif 'a' in rsv_types:
                        useful[counter:last_pos+1] = ['v_rsv_a']+['connecting']*(last_pos-counter)
                    elif 'b' in rsv_types:
                        useful[counter:last_pos+1] = ['v_rsv_b']+['connecting']*(last_pos-counter)
                else:
                    useful[counter:counter+3] = ['v_rsv', 'connecting', 'connecting']

            #UNKNOWN VIRUS
            elif (token.startswith('vir') or token.startswith('viu')):
                #extra rule for virus culture
                if (tokenized_length > counter+2) and tokenized[counter+1].startswith('cult')\
                and 'request' in tokenized[counter+2]:
                    useful[counter:counter+3] = ['connecting']*3
                elif (tokenized_length > counter+1) and tokenized[counter+1].startswith('cult'):
                    useful[counter:counter+2] = ['t_oth']*2
                else:
                    useful[counter] = 'v_unk'

            #extra terms to treat as an "unknown virus" for purpose of algorithm
            elif token in ('by','further','specimen','specimens','test','sample','considered'):
                useful[counter] = 'v_unk'

        # loop over the record again
        for counter, token in enumerate(tokenized):

            #skip if already assigned
            if useful[counter]:
                continue

            #culture tests  
            if token.startswith('cult') and not ((tokenized_length > counter+1) and 'request' in tokenized[counter+1]):
                useful[counter] = 't_oth'

            #additional "direct" tests
            elif token == 'direct' and (tokenized_length > counter+1):
                if tokenized[counter+1] in ('kit', 'enzyme', 'test', 'testing', 'eia', 'antigen', 'ict'):
                    useful[counter:counter+2] = ['t_oth']*2
                elif tokenized[counter+1] in ('influenza',):
                    useful[counter] = 't_oth'

            #condition for mention of pos/neg
            elif token in ('negative','neg','positive','pos','detected','organism')\
            and (tokenized_length > counter+1)\
            and ((tokenized[counter-1] in ('a','original','or','level','of','the','tested','was','false','had','antibodies','antibody') 
                  and tokenized[counter+1] in ('test','result','covid','new','at','note','sars')) 
                 or tokenized[counter+1] in ('or','swab','to','contact','workers','retest','results',
                                             'son','person','patients','travel','individual','admission',
                                             'roommate','rapid','coworker','co','cultures')):
                useful[counter-1:counter+2] = [None]*3
            elif token in ('negative','neg','positive','pos','detected','organism','posivtive')\
            and (tokenized_length > 1)\
            and (tokenized[counter-2] in ('previous','previously','contact','worker','depot','targets',
                                          'being','unless','patient','law','due','exposure','needs','if',
                                          'swab','who','psw','known','must','mother')
                 or tokenized[counter-1] in ('previous','previously','known','unit','first','second',
                                             'needs','need','requires','considered','swab','if',
                                             'depot','employee','gram','cx','member','coworker','shows',
                                             'father','contact','both','and','confirm','despite','provide',
                                             'mom','verify','particles')):
                useful[counter-1:counter+1] = [None]*2
            elif token in ('negative','neg','positive','pos','detected','organism')\
            and (tokenized_length > 3)\
            and (tokenized[counter-3] in ('mom','him','father','which','sister','partner','contact','who')
                or tokenized[counter-4] in ('who',)):
                useful[counter-2:counter+1] = [None]*3 

            #condition for word before no
            elif token == 'no' and (tokenized[counter-1] in ('by','lab','specimen','accession',
                                                             'sample','order','please','phl',
                                                             'with','bipap','pain','ord')
                                    or any([pattern in tokenized[counter-1] for pattern in ('out','break','inv')]))\
            and tokenized[counter+1:counter+2] != ['virus'] and counter > 0:
                useful[counter-1:counter+1] = [None]*2

            #condition for word after no (skip)
            elif token == 'no' and (tokenized_length > counter+1)\
            and tokenized[counter+1] in ('fever','answer','longer','patient','second','grh','afb',
                                         'exposure','appetite','outbreak','symptoms','symmptoms',
                                         'show','2nd','transfer','requisition','evidence','dob',
                                         'response','name','unexplained','lts','hc','hcn','paperwork',
                                         'further','pick','time','call','other','viral', 'need', 'salmonella', 
                                         'date', 'submission', 'health','subsitute','cough','two','identifying','match',
                                         'growth','birthday'):
                useful[counter] = 'skip'

            #condition for word after no (cancel)
            elif token == 'no' and (tokenized_length > counter+1)\
            and tokenized[counter+1] in ('specimen','reportable','done','gene','result',
                                         'media','liquid','sample','swab','nasopharyngeal','record','fluid',
                                         'patient','second','results','testing','eluate','option','chose',
                                         'speicmen','label','validated','culture','volume','saliva','requisiton',
                                         'naso','confirmatory','dry'): 
                useful[counter] = 'r_can'

            #condition for due to
            elif tokenized[counter:counter+2] in [['due','to'],['for','result']] and 'new' not in tokenized[counter+2:counter+4]:
                useful[counter:counter+2] = ['stop']*2

            #condition for word after not (skip)
            elif token == 'not' and (tokenized_length > counter+1)\
            and tokenized[counter+1] in ('test','been','suspicious','validated','valildated','the','admitted',
                                         'preclude','intended','likely','given','coming','verified','ha','covid',
                                         'for','signed','on','recommended','retrieve','corssing','swabbed',
                                         'beingberecollected','at','eligible','vaccinated','put', 'leaking', 'per',
                                         'approved','c','nose','only','informed'):
                useful[counter:counter+2] = ['skip']*2


            #condition for word after not (cancel)
            elif token == 'not' and (tokenized_length > counter+1) and \
            tokenized[counter+1] in ('tested','tessted','perform','performed','process','processed', 
                                     'transmit','suitable','done','doen','be','reported','received', 
                                     'match','needed','labelled','available','symptomatic','forwared',
                                     'met','specified','indicated','returned','sufficient',
                                     'valid','required','able','needed','contain','ordered','recieved',
                                     'labeled','a','provided','appropriate','sent','send','remove',
                                     'report','rapid','found','applicable','rec','used','order',
                                     'matched','labled','proccessed','accepted','receivd','completed',
                                     'recollect','preformed','appearing','in','collected','obtained',
                                     'acceptable','capped','requested','enough','an','clearly','midturbinate',
                                     'refrigerated','refrigirated'):
                useful[counter:counter+2] = ['r_can']*2

            #condition for word before not
            elif token == 'not' and tokenized[counter-1] in ('does','did','please','done','over','swab','but','do','can','or'): #can?
                useful[counter-1:counter+1] = ['skip']*2

            #condition for errors
            elif tokenized[counter:counter+3] in (['ordered','in','error'], ['positive','in','error'],
                                                 ['no','covid','result'], ['pos','in','error'],
                                                 ['sent','under','incorrect'], ['no','covid','order'],
                                                 ['s','already','positive']): 
                useful[counter:counter+3] = ['r_can']*3
            elif tokenized[counter:counter+3] in (['added','in','error'], ['the','wrong','patient'],['reported','in','error']):
                useful[counter:counter+3] = ['reset']*3
            elif tokenized[counter:counter+2] in (['processing','error'], ['in','error'], ['same','test'],
                                                  ['labelling','error']):
                useful[counter:counter+2] = ['r_can']*2
            elif tokenized[counter:counter+4] in (['report','please','disregard','covid'],
                                                  ['patient','please','disregard','results'],
                                                  ['report','please','disregard','report'],
                                                  ['individual', 'target', 'result', 'interpretation'],
                                                  ['individual', 'target', 'results', 'interpretation'],
                                                  ['requesting', 'negative', 'covid' ,'swab'],
                                                  ['detected', 'cycle', 'threshold' ,'35'],
                                                  ['does','not','belong','to'],
                                                  ['outbreak', 'no', 'bdpoc', 'valid'],
                                                  ['father', 'tested', 'positive', 'for'],
                                                  ['multiplex', 'covid', 'flu', 'rspending'],
                                                  ['report', 'positive', 'family','member'],
                                                  ['organism', 'quantitation', 'of', 'growth'],
                                                  ['please', 'disregard', 'covid', '19'],
                                                  ['was', 'chosen', 'in', 'error']):
                useful[counter:counter+4] = ['end']*4
            elif tokenized[counter:counter+4] in (['please','remove','previous','copies'],):
                useful[counter:counter+4] = ['v_covid','r_can','r_can','r_can']

            #condition for target rna and disregard
            elif tokenized[counter:counter+2] in (['target','rna'],['patient','disregard'] ): 
                useful[counter:counter+2] = ['end']*2

            #condition for mother positive 
            elif token == 'mother' and (tokenized_length > counter+1)\
            and tokenized[counter+1] in ('positive'):
                useful[counter:counter+2] = ['skip']*2


            #condition for previous
            elif 'previous' in token and ('reported' in tokenized[counter+1:counter+3] or
                                          'specimen' in tokenized[counter+1:counter+2] or
                                          (tokenized[counter+1:counter+3] == ['report','covid'] and
                                               tokenized[counter-1] == 'the') or
                                          tokenized[counter+1:counter+3] in (['report','of'],
                                                                             ['reports','of'],
                                                                             ['reportof','covid'],
                                                                             ['result','of'],
                                                                             ['covid','19'],
                                                                             ['entered','covid'],
                                                                             ['report','as'],
                                                                             ['result','was'],
                                                                             ['report','that'])):
                useful[counter:counter+2] = ['end']*2

            #unable and indeterminate
            elif tokenized[counter:counter+5] in (['unable','to','be','completed','indeterminate'],
                                                 ['please','disregard','previous','indeterminate','result'],
                                                 ['interpretation','possible','low','viral','load'],
                                                 ['all','positive','and','indeterminate','results'],
                                                 ['19','positive','e','e','ab'],
                                                 ['repeat', 'gx', 'e', 'neg', 'n'],
                                                 ['for', 'clinical', 'use', 'indeterminate', 'for'],
                                                 ['indeterminate', 'cycle', 'threshold','99', 'specimen'],
                                                 ['in', 'low', 'has', 'positive','covid'],
                                                ['rejection','interface','error','specimen','not'],
                                                  ['husband', 'swabbed', 'positive', 'for', 'covid'],
                                                  ['born', 'to', 'covid', 'positive', 'mother'],
                                                  ['newborn', 'to', 'covid', 'positive', 'mother'],
                                                  ['to', 'covid', '19', 'positive','mother'],
                                                  ['roommate', 'was', 'covid', 'positive', 'day'],
                                                  ['delivered', 'to', 'covid', 'positive', 'mother'],
                                                  ['exposed', 'to', 'a', 'positive', 'case'],
                                                  ['exposed', 'to', 'covid', 'positive', 'case'],
                                                  ['need', 'confirmation', 'of', 'negatvie', 'swab'],
                                                  ['icu', 'admission', 'no', 'covid', 'swab'],
                                                  ['positive', 'patient', 'sample', 'submitted', 'for']
                                                ):

                useful[counter:counter+4] = ['connecting']*4

            #upon further investigation
            elif tokenized[counter:counter+3] in (['upon','further','investigation'],):

                useful[counter-2:counter] = ['reset']*2

            #exclude antibody and reordered
            elif tokenized[counter:counter+8] in (['negative','for','anti','sars','cov','2','igg','antibodies'],
                                                  ['negative','for','anti','sars','cov','2','iga','antibodies'],
                                                 ['positive', 'for', 'anti', 'sars', 'cov' ,'2', 'igg', 'antibodies'],
                                                 ['positive', 'for', 'anti', 'sars', 'cov' ,'2', 'iga', 'antibodies'],
                                                 ['campylobacter', 'yersinia', 'or', 'e', 'coli', '0157', 'h7', 'isolated'],
                                                 ['client', 'disclosed', 'chosen', 'not', 'to', 'vaccinat', 'for', 'covid'],
                                                 ['forwarded', 'to', 'public', 'health', 'for', 'covid', 'flu', 'and'],
                                                 ['label','on','specimen','not','matching','please','re','collect'],
                                                 ['cepheid', 'multiplex', 'covid', 'flu', 'rsfinal', 'test', 'not', 'performed'],
                                                 ['and', 'covidscr', 'are', 'ordered', 'on', 'the', 'same', 'swab'],
                                                 ['neg', 'pos', 'rpt','pos', 'further', 'report', 'to', 'follow'],
                                                 ['disregard', 'previous', 'result', 'stating', 'covid', '19', 'virus', 'detected'],
                                                  ['the', 'university', 'health', 'covid', '19', 'virus', 'not', 'detected'],
                                                  ['detected', 'sars', 'cov', '2', 'n', 'gene', 'low', 'positive'],
                                                  ['to', 'covid', '19', 'positive', 'case', 'daughter', 'visited', 'yesterday'],
                                                  ['of', 'dry', 'coughing', 'and', 'no', 'covid', 'test', 'in'],
                                                  ['admitted', 'to', 'nicu', 'exposed', 'to', 'covid', 'positive', 'mother']
                                                 ):

                useful[counter:counter+8] = ['None']*8

            #assign pending
            elif tokenized_length==3 and tokenized[counter:counter+3] == ['see','scanned','result']: 
                useful[counter:counter+3] = ['r_pen']*3 

            #exclude accidently picked-up negatives
            elif 'inneg' in token or 'lineag' in token:
                useful[counter] = 'connecting'

            else:
                #indirect_matches dictionary
                for term, patterns in indirect_matches_dict.items():
                    if any([pattern in token for pattern in patterns]):
                        useful[counter] = term
                        break

                #direct_matches dictionary
                for term, patterns in direct_matches_dict.items():
                    if any([pattern == token for pattern in patterns]):
                        useful[counter] = term
                        break

                #test_type dictionary
                for test, patterns in test_type_dict.items():
                    if any([pattern == token for pattern in patterns]):
                        useful[counter] = test
                        break

        return useful

    #interpret text to get initial results
    def interpret(useful):

        def save(b):
            #presumptive modifier
            if b[1] == 'r_pos' and modifier[1]:
                b[1] = 'r_pre'

            #end modifier (skips a save)
            if not modifier[2] or modifier[0]:
                output.append([b[0] if b[0] else 'v_unk', 
                               b[1] if b[1] else 'r_neg', 
                               b[2] if b[2] else 't_unk', 
                               modifier[0]]) #final modifier

            b[0] = None
            b[1] = None
            modifier[0:4] = [False, False, False, False]
            return

        sentence = useful[:]
        output = []

        #bundle for current virus/result/test
        #0 = virus, 1 = result, 2 = test
        bundle = [None, None, None]

        #modifiers
        #0 = final, 1 = presumptive, 2 = end, 3 = skip
        modifier = [False, False, False, False]

        none_counter = 0 #counter for hitting consecutive irrelevant words
        virus_counter = 0 #counter for different viruses in same segment

        #xml field processing
        xml_pos = [i for i, x in enumerate(sentence) if x == 'xml']
        num = len(xml_pos)//2
        for i in range(num):
            xml_start_pos = xml_pos[i*2]
            xml_end_pos = xml_pos[i*2+1]
            for j in range(xml_start_pos, xml_end_pos + 1):
                if sentence[j] and sentence[j].startswith('v_') and sentence[j] != 'v_unk':
                    bundle[0] = sentence[j]
                    bundle[1] = 'r_pos'
                    save(bundle)

        #add result to output if result in first 3 words
        if len(sentence) > 3 and not any(['v_' in s for s in sentence[0:3] if s]):
            for s in sentence[0:3]:
                if s and 'r_' in s:
                    output.append(['v_unk', s, 't_unk', False])
                    break
                elif s and s == 'connecting':
                    pass
                else:
                    break

        #if there is mention of retest but no final result flag, assign final flag to start
        if ('retest' in sentence or (len(sentence) > 0 and sentence[0] == 'r_ind')) and 'final' not in sentence :
            modifier[0] = True

        #loop on words in sentence
        for word in sentence:

            if word: #relevant term
                none_counter = 0 #restart counter

                #set current virus 
                if word.startswith('v_'):
                    #different virus
                    if word != 'v_unk' and word != bundle[0]:
                        #save current result if hitting a different virus
                        if bundle[0] and bundle[0] != 'v_unk':
                            #skip modifier
                            if modifier[3]:
                                modifier[3] = None #reset skip modifier
                                bundle[1] = None
                            else:
                                save(bundle)
                                virus_counter += 1 #increase counter if different virus in segment     
                        bundle[0] = word
                    #same virus
                    elif word != 'v_unk' and word == bundle[0]:
                        #save current result if there is one
                        if bundle[1]:
                            save(bundle)
                        bundle[0] = word
                    #only set to general virus if there's no current virus
                    elif word == 'v_unk' and not bundle[0]:
                        bundle[0] = word

                #set current result
                elif word.startswith('r_'):
                    if word == 'r_ind':
                        if bundle[1]: 
                            save(bundle)
                        bundle[1] = word
                    elif word == 'r_neg' and bundle[1] not in ('r_ind',):
                        if bundle[1]: 
                            save(bundle)
                        bundle[1] = word
                    elif word == 'r_pos' and bundle[1] not in ('r_ind', 'r_neg'):
                        bundle[1] = word

                    elif word in ('r_rej', 'r_can', 'r_pen') and bundle[1] not in ('r_ind', 'r_neg', 'r_pos'):
                        if word == 'r_rej':
                            bundle[1] = word
                        elif word == 'r_can' and bundle[1] not in ('r_rej',):
                            bundle[1] = word
                        elif word == 'r_pen' and bundle[1] not in ('r_rej', 'r_can'):
                            bundle[1] = word

                #set current test
                elif word.startswith('t_'):
                    bundle[2] = word

                #final modifier
                elif word == 'final':
                    if bundle[0] and bundle[1]:
                        save(bundle)
                    modifier[0] = True
                    bundle[1] = None #reset result

                #presumptive modifier
                elif word == 'presumptive':
                    modifier[1] = True

                #end modifier/word
                elif word == 'end':
                    #end early only if there is already result
                    if len([o for o in output if o[1] in ('r_ind', 'r_neg', 'r_pos')]) > 0: 
                        return output
                    if bundle[0] and bundle[1]:
                        save(bundle)
                    modifier[0:4] = [False, False, True, False] #end modifier skips next save
                    #end early only if there is already result
                    if len(output) > 0:
                        return output

                #skip modifier
                elif word == 'skip':
                    if bundle[0] and bundle[1]:
                        save(bundle)
                    modifier[3] = True
                    bundle[0] = None
                    bundle[1] = None

                #stop word
                elif word == 'stop':
                    if bundle[0] and bundle[1]:
                        save(bundle)
                    modifier[0:4] = [False, False, False, False]
                    bundle[0] = None
                    bundle[1] = None           

                #reset word
                elif word == 'reset':
                    modifier[0:4] = [False, False, False, False]
                    bundle[0] = None
                    bundle[1] = None

            else: #word is None
                none_counter += 1

                if none_counter == 2: #can change threshold
                    #save if there is current virus and result
                    if bundle[0] and bundle[1]:
                        save(bundle)
                    #save the last virus if multiple were listed
                    elif bundle[0] and bundle[0] != 'v_unk' and virus_counter > 1:
                        if modifier[3]:
                            modifier[3] = None #reset skip modifier
                        else:
                            save(bundle)
                    #reset
                    none_counter = 0 
                    virus_counter = 0
                    bundle[0] = None
                    bundle[1] = None
                    modifier[0:4] = [False, False, False, False]

        #if there is still a remaining result
        if bundle[1]: 
            save(bundle)

        #if there is an extra virus listed at the end
        elif bundle[0] and bundle[0] != 'v_unk' and virus_counter > 1 and not modifier[3]:
            save(bundle)

        return output

    #using reference excel to assign LOINCs to virus and test type
    #added COVID19 LOINCs
    xlsx_filename = 'COVID19_Resp_codes_20241231.xls'
    mappings = {'--':'unk', 'culture':'cult', 'other':'oth', 'entero_rhino_D68':'entero_rhino'}

    df_loincs = pd.read_excel(xlsx_filename, sheet_name='Resp_LOINCs')
    df_loincs_covid = pd.read_excel(xlsx_filename, sheet_name='COVID19_LOINCs')
    df_loincs = df_loincs.append(df_loincs_covid)

    #cleaning the categories to match previously defined ones
    df_loincs = df_loincs.replace(mappings)
    df_loincs['Virus_to_assign'] = df_loincs['Virus_to_assign'].apply(lambda x: 'coronavirus' if 'corona' in x else x)
    df_loincs['Virus_to_assign'] = df_loincs['Virus_to_assign'].apply(lambda x: 'v_' + x)
    df_loincs['Test_to_assign'] = df_loincs['Test_to_assign'].apply(lambda x: 't_' + x)

    #assign LOINCs to virus and test type
    loincs_by_v = {}
    loincs_by_t = {}
    for index, row in df_loincs.iterrows():
        loincs_by_v.setdefault(row['Virus_to_assign'], [])
        loincs_by_v[row['Virus_to_assign']].append(row['LOINCs'])
        loincs_by_t.setdefault(row['Test_to_assign'], [])
        loincs_by_t[row['Test_to_assign']].append(row['LOINCs'])

    #remove the unk ones
    del loincs_by_v['v_unk']
    del loincs_by_t['t_unk']

    #use reference excel to assign TR codes to virus and test type
    #added COVID19 TR codes
    df_tr_codes = pd.read_excel(xlsx_filename, sheet_name='Resp_TRs')
    df_tr_covid = pd.read_excel(xlsx_filename, sheet_name='COVID19_TRs')
    df_tr_codes = df_tr_codes.append(df_tr_covid)

    #cleaning the categories to match previously defined ones
    df_tr_codes = df_tr_codes.replace(mappings)
    df_tr_codes['Virus_to_assign'] = df_tr_codes['Virus_to_assign'].apply(lambda x: 'coronavirus' if 'corona' in x else x)
    df_tr_codes['Virus_to_assign'] = df_tr_codes['Virus_to_assign'].apply(lambda x: 'v_' + x)
    df_tr_codes['Test_to_assign'] = df_tr_codes['Test_to_assign'].apply(lambda x: 't_' + x)

    #assign LOINCs to virus and test type
    tr_codes_by_v = {}
    tr_codes_by_t = {}
    for index, row in df_tr_codes.iterrows():
        tr_codes_by_v.setdefault(row['Virus_to_assign'], [])
        tr_codes_by_v[row['Virus_to_assign']].append(row['TRs'])
        tr_codes_by_t.setdefault(row['Test_to_assign'], [])
        tr_codes_by_t[row['Test_to_assign']].append(row['TRs'])

    #remove the unk ones
    del tr_codes_by_v['v_unk']
    del tr_codes_by_t['t_unk']

    # assign more details to v_unk or t_unk based on LOINC and TR code
    # group by test type and then type of virus, remove duplicates
    loinc_exclusions = ['10219-4','10182-4','11329-0','14869-2','21026-0','22633-2','22634-0','22635-7','22636-5','22637-3',
                        '22638-1','22639-9','3150-0','33882-2','35265-8','41000-1','47526-9','49049-0','55752-0','56816-2',
                        '59465-5','59466-3','611-4','664-3','66746-9','76425-8','94659-0','XON10007-3','XON10011-5',
                        'XON10313-5','XON10315-0','XON10316-8','XON10337-4','XON11913-1','XON12721-7','XON12875-1',
                        'XON13543-4','XON13544-2','XON13545-9', 'XON10318-4','XON10335-8','XON12949-4','94558-4','94661-6',
                        '43305-2','2075-0','2524-7','29308-4','27112-2','633-8','48807-2','45012-2','94508-9','94762-2',
                        'XON13688-7','XON13553-3','608-0','97155-6','24041-6','10466-1','2028-9','60568-3','13316-5',
                        'XON10013-1','4544-3','32623-1','6690-2','718-7','751-8','777-3','785-6','786-4','787-2','788-0',
                        '789-8','18727-8','57006-9','630-4','XON11903-2','14682-9','77202-0','48815-5','42216-2','26881-3',
                        '61377-8','48575-5','90041-5','88636-6','29546-9','65222-2','94652-5','82752-7','91560-3','93848-0',
                        '94651-7','XON12106-1', 'XON13682-0', '32367-5','62424-7','4996-5','8665-2','52121-1','67098-4',
                        '52973-5','XON11882-8','32355-0','59464-8','42957-1','29723-4','10652-6','2705-2','XON12395-0',
                        'XON11914-9','XON13691-1','XON10710-2']
    #Adding Loinc
    tr_exclusions = ['TR12942-9','TR12952-8','TR12953-6','TR10264-0','TR11823-2','TR10854-8','TR12253-1','TR10487-7',
                     'TR11723-4','TR12704-3','TR11935-4','TR12000-6','TR12613-6','TR11443-9','TR12210-1','TR11943-8',
                     'TR11721-8','TR10807-6','TR11640-0','TR11673-1','TR12029-5','TR11418-1','TR11976-8','TR10749-0',
                     'TR10149-3','TR12259-8','TR11398-5','TR11671-5','TR11882-8','TR10713-6','TR10826-6','TR11763-0',
                     'TR10694-8','TR10686-4','TR11557-6','TR11561-8','TR11987-5','TR11933-9','TR10882-9','TR12584-9',
                     'TR11390-2','TR10745-8','TR11989-1','TR10540-3']

    def process_result(tokens, testrequestcode, observationcode, results):
        dd = {}
        update_t_pcr_flag = False

        ###extra conditions

        #remove voc 
        if any([t in ['voc','vocs','variant','variants'] for t in tokens[0:7]]):
            return dd

        #LOINC/TR exclusions
        if (observationcode in loinc_exclusions) or (testrequestcode in tr_exclusions):
            return dd

        for i in range(len(tokens)):
            #exclude antibody tests
            if tokens[i:i+3] in (['covid','19','igg'],['cov','2','antibodies'],):
                return dd
            #delete all results for irrelevant phrases
            if tokens[i:i+5] in (['swab', 'is', 'required', 'for', 'both'],
                                 ['is', 'unable', 'to', 'go', 'until'],
                                 ['human','herpes','simplex','virus','type'],
                                 ['registration','error','please','disregard','result'],
                                 ['was','exposed','to','caregiver','with'],
                                 ['corrected','report','specimen','sent','to']):
                return dd
            #make presumptive-positive if test is investigational
            if tokens[i:i+3] in (['not', 'been', 'established'], 
                                 ['is', 'considered', 'investigational'], 
                                 ['a', 'retrospective', 'review']):
                for r in results:
                    if (r[0] == 'v_covid' or r[0] == 'v_unk') and r[1] == 'r_pos':
                        r[1] = 'r_pre' 
            #change negative to pending if there are results to follow
            if tokens[i:i+3] == ['to', 'follow', 'tested']:
                for r in results:
                    if r[1] in ('r_neg','r_can','r_rej') and not r[3]:
                        r[1] = 'r_pen'
            if tokens[i] == 'naat' or tokens[i:i+4] == ['isothermal', 'nucleic', 'acid', 'amplification'] or tokens[i:i+5] == ['confirmatory', 'pcr', 'testing', 'is', 'required'] or tokens[i:i+3] == ['pcr', 'not', 'collected']:
                update_t_pcr_flag = True
            if tokens[i:i+2] == ['id', 'now'] and 'detected' in tokens:
                update_t_pcr_flag = True

        #change negative to indeterminate for indeterminate multiplex
        if len(set([v for (v,r,t,f) in results])) > 4:
            if ('indeterminate' in tokens[0:3] or results[0][0:2] == ['v_unk','r_ind']): 
                for r in results:
                    if r[1] == 'r_neg':
                        r[1] = 'r_ind'
            elif ('duplicate' in tokens[0:3]): 
                for r in results:
                    if r[1] == 'r_neg':
                        r[1] = 'r_can'

        #if there is already a covid result, remove v_unk
        if any([(r[0] == 'v_covid' and r[1] in ['r_pos', 'r_pre', 'r_ind', 'r_neg']) for r in results]):
            results = [r for r in results if r[0] != 'v_unk']

        #remove pcr info if it is a rapid test
        if testrequestcode == 'TR12946-0':
            if update_t_pcr_flag:
                for i in range(len(results)):
                    if results[i][2] == 't_pcr':
                        results[i][2] = 't_unk' 
            else:
                results = [r for r in results if r[2] != 't_pcr']

        ###determine virus or test based on LOINC or TR
        v_from_loinc = [loinc_vir for loinc_vir, loincs in loincs_by_v.items() if observationcode in loincs]
        v_from_tr = [tr_codes_vir for tr_codes_vir, tr_codes in tr_codes_by_v.items() if testrequestcode in tr_codes]
        t_from_loinc = [loinc_test for loinc_test, loincs in loincs_by_t.items() if observationcode in loincs]
        t_from_tr = [tr_codes_test for tr_codes_test, tr_codes in tr_codes_by_t.items() if testrequestcode in tr_codes]

        #determine if there are any final/interpretation results
        viruses_with_final = [v for (v,r,t,f) in results if r in ('r_pos', 'r_pre', 'r_ind', 'r_neg', 'r_rej') and f]
        results_final = results
        #remove the non-final/interpretation results for viruses with final/interpretation
        for vf in viruses_with_final:
            results_final = [(v,r,t,f) for (v,r,t,f) in results if not (v in vf and not f)]

        for v, r, t, f in results_final:
            #fill in unknown virus
            if v == 'v_unk':
                if len(v_from_loinc) > 0:
                    v = v_from_loinc[0]
                elif len(v_from_tr) > 0:
                    v = v_from_tr[0]

            #fill in unknown test
            if t == 't_unk':
                if len(t_from_loinc) > 0:
                    t = t_from_loinc[0]
                elif len(t_from_tr) > 0:
                    t = t_from_tr[0]

            #fill in pcr if there is a pcr term in text
            if t == 't_unk' and 'pcr' in tokens: 
                t = 't_pcr'

            #remove unknown virus results
            if v != 'v_unk':
                v, r, t = v[2:], r[2:], t[2:]
                #all tests that aren't pcr are oth
                #t = t if t == 'pcr' else 'oth'

                #ASSUME EVERYTHING PCR FOR COVID DATASET
                t = 'pcr'

                dd.setdefault(t, [])

                #compiling results with hierarchy: S (presumptive positive) > P (positive) > I (indeterminate) 
                #                                  > N (negative) > D (pending) > R (invalid) > C (cancelled) 
                same_vir = False
                for i in range(len(dd[t])):
                    if v == dd[t][i][0]:
                        same_vir = True
                        if r == 'pre':
                            dd[t][i] = (v,r)
                        elif r == 'pos' and dd[t][i][1] not in ('pre',):
                            dd[t][i] = (v,r)
                        elif r == 'ind' and dd[t][i][1] not in ('pre', 'pos'):
                            dd[t][i] = (v,r)
                        elif r == 'neg' and dd[t][i][1] not in ('pre', 'pos', 'ind'):
                            dd[t][i] = (v,r)
                        elif r == 'pen' and dd[t][i][1] not in ('pre', 'pos', 'ind', 'neg'):
                            dd[t][i] = (v,r)
                        elif r == 'rej' and dd[t][i][1] not in ('pre', 'pos', 'ind', 'neg', 'pen'):
                            dd[t][i] = (v,r)
                        elif r == 'can':
                            pass
                if not same_vir:
                    dd[t].append((v,r))

        return dd

    #create output as character value for each virus and test type
    result_char = {'pre':'S', 'pos': 'P', 'ind':'I', 'neg':'N', 'pen':'D', 'can':'C', 'rej':'R'}

    def char_output(results, ind):

        #loop through each test type and virus
        for t, pairs in results.items(): #need to update if there are multiple test types
            for v, r in pairs:
                if v in ('adenovirus', 'bocavirus', 'coronavirus', 'entero_rhino', 'hmv', 'covid'):
                    df_results.at[ind, v] = result_char[r]

                elif v.startswith('para'):
                    df_results.at[ind, 'para'] = result_char[r]

                elif v.startswith('flu'):
                    df_results.at[ind, 'flu'] = result_char[r]
                    if '_a' in v:
                        df_results.at[ind, 'flu_a'] = result_char[r]
                    if '_h1' in v:
                        df_results.at[ind, 'flu_a_h1'] = result_char[r]
                    if '_h3' in v:
                        df_results.at[ind, 'flu_a_h3'] = result_char[r]
                    if '_b' in v:
                        df_results.at[ind, 'flu_b'] = result_char[r]

                elif v.startswith('rsv'):
                    df_results.at[ind, 'rsv'] = result_char[r]
                    if '_a' in v:
                        df_results.at[ind, 'rsv_a'] = result_char[r]
                    if '_b' in v:
                        df_results.at[ind, 'rsv_b'] = result_char[r]

        return

    #%%time

    #clean text
    df_gp["cleaned_value"] = df_gp["value"].apply(clean)

    #group by unique records (org, TR code, Obs code, cleaned text) and store original indexes as tuple
    df_gp = df_gp.reset_index()
    groupby_vars = ['reportinglaborgname', 'testrequestcode', 'observationcode', 'cleaned_value']
    df_gp = df_gp.groupby(groupby_vars).agg({'value': 'count', 
                                                     'original_indexes': lambda x: tuple([i for tup in x for i in tup])}).reset_index()
    df_gp = df_gp.rename(columns={'value':'count'})

    df_gp = df_gp.sort_values(by=['count'], ascending=False).reset_index(drop=True)
    print('unique records after cleaning:', len(df_gp))

    #tokenize
    df_gp["cleaned_tokenized_value"] = df_gp["cleaned_value"].apply(tokenize)

    #assign labels using dictionary
    df_gp["useful_tokens"] = df_gp["cleaned_tokenized_value"].apply(assign_labels)

    #interpret the labelled tokens
    df_gp["initial_results"] = df_gp["useful_tokens"].apply(interpret)

    #fill in unknown viruses based on LOINC or TR code, roll up results to one test type
    final_results = []
    for i in range(len(df_gp)):
        final_results.append(process_result(df_gp["cleaned_tokenized_value"][i],
                                            df_gp["testrequestcode"][i], df_gp["observationcode"][i], 
                                            df_gp["initial_results"][i]))

    #translate results to 1-character format
    col_virus = ['covid', 'adenovirus', 'bocavirus', 'coronavirus', 'flu', 'flu_a', 'flu_a_h1', 'flu_a_h3', 'flu_b',
             'entero_rhino', 'hmv', 'para', 'rsv', 'rsv_a', 'rsv_b']

    #create empty df to fill in results
    df_results = pd.DataFrame(index=np.arange(len(df_gp)), columns=['original_indexes']+col_virus)
    df_results['original_indexes'] = df_gp['original_indexes']

    #fill in results
    for i in range(len(df_gp)):
        char_output(final_results[i], i)

    output = [None]*len(df)

    #order results based on original_indexes
    for row in df_results.itertuples():
        for i in row[1]: #original_indexes
            output[i] = tuple(row[2:])

    if output_flag == 1:                
        df_output = pd.concat([df, pd.DataFrame(output, columns=col_virus)], axis=1)

    elif output_flag == 2: 
        df_output = df_raw.join(df[['exclude_flag']].join(pd.DataFrame(output, columns=col_virus)))

    else:
        print('PLEASE ENTER ONE OF THE FOLLOWING OPTIONS FOR OUTPUT_FLAG IN THE FIRST CELL: 1, 2')

    #FINAL DATASET TO OUTPUT
    df_output.to_csv(f'{output_filename}_{batch}.csv', index=False)
    df_output['covid'].value_counts()
    #df_output.describe()

    #tracker for unique records (some records may be marked as new if clean function changes)

    #initialize tracker
    try:
        f = open('record_tracker.pkl')
        f.close()
    except FileNotFoundError:
        df_tracker = pd.DataFrame(columns=['filename', 'reportinglaborgname', 'testrequestcode', 'observationcode', 'cleaned_value'])
        df_tracker.to_pickle("./record_tracker.pkl")
        print('CREATING RECORD TRACKER FILE')

    #read tracker
    df_tracker = pd.read_pickle('./record_tracker.pkl')

    #RESET TRACKER
    #df_tracker = df_tracker.iloc[0:0]

    df_tracker_orig = df_tracker[['reportinglaborgname', 'testrequestcode', 'observationcode', 'cleaned_value']].copy(deep = True)
    df_tracker_delta = df_gp[['reportinglaborgname', 'testrequestcode', 'observationcode', 'cleaned_value']].copy(deep = True)

    #set difference
    df_tracker_delta = pd.concat([df_tracker_delta, df_tracker_orig, df_tracker_orig], ignore_index=True).drop_duplicates(keep=False)
    print('Original tracker length:', len(df_tracker_orig))
    print('Delta tracker length:', len(df_tracker_delta))


    #intermediate output for checking results
    int_output_cols = ['count', 'reportinglaborgname', 'testrequestcode', 'observationcode', 'cleaned_value']
    df_gp[int_output_cols].join(df_results.drop(columns=['original_indexes'])).to_csv(f'intermediate_output_{batch}.csv')
    df_gp[int_output_cols][df_gp.index.isin(df_tracker_delta.index)].join(df_results.drop(columns=['original_indexes'])).to_csv(f'intermediate_output_delta_{batch}.csv')

    #Extract new TR codes and LOINCs
    df_code_og_tr = df_tracker['testrequestcode'].drop_duplicates()
    df_code_og_loinc = df_tracker['observationcode'].drop_duplicates()
    df_code_new_tr = df_tracker_delta['testrequestcode'].drop_duplicates()
    df_code_new_loinc = df_tracker_delta['observationcode'].drop_duplicates()
    print(df_code_new_tr[~df_code_new_tr.isin(df_code_og_tr)])
    print(df_code_new_loinc[~df_code_new_loinc.isin(df_code_og_loinc)])

    #FINALIZE THE RECORD TRACKER (only run when you are satisfied with the review process)
    #add filename
    df_tracker_delta['filename'] = input_filename

    #add the delta
    df_tracker = pd.concat([df_tracker, df_tracker_delta], sort=False, ignore_index=True)

    #save file
    df_tracker.to_pickle("./record_tracker.pkl")
    print('Records in tracker:', len(df_tracker))

    #cleanup
    del df
    del df_gp
    del df_output
    del df_tracker
    del df_tracker_orig
    del df_tracker_delta
    
    return current_batch_cut_index


# Process Data

## Parameter Definition

In [None]:
# the number of record to be processed in each batch, 30000000 is an example. You can try any number you want
batch_size = 30000000

##IntegrationPoint03_Open
# input path and filename (can be .sas7bdat file or csv file)
input_path = '//'

# change the date of input file if needed
input_filename = '.sas7bdat'

# name of patientid variable in input dataset, will be renamed as 'patientid'
input_patientid_var = 'ikn'
##IntegrationPoint03_Close

# output additional columns
#1 = with key columns, 2 = with ALL columns
output_flag = 1

# output filename
output_filename = 'output'

## Validation: check if the input file is sorted by ordersid.

In [None]:
def checkSorted(file_path, file_name, col_name, descending=True):
    if file_name.endswith('.csv'):
        ##IntegrationPoint05_Open
        ordersid_ls = pd.read_csv(file_path+file_name)[col_name].values.tolist()
        ##IntegrationPoint05_Close
    elif input_filename.endswith('.sas7bdat'):
        df, _ = pyreadstat.read_sas7bdat(filename_path=file_path+file_name)
        ordersid_ls = df[col_name].values.tolist()
        del df
    else:
        raise Exception("incorrect file format")
    is_sorted = all(a >= b for a, b in zip(ordersid_ls, ordersid_ls[1:])) if descending else all(a <= b for a, b in zip(ordersid_ls, ordersid_ls[1:]))
    del ordersid_ls
    return is_sorted

In [None]:
# if the input file is sorted by ordersid descending, delete "False". If ascending, you can keep it as it is
is_sorted = checkSorted(input_path, input_filename, 'ordersid', False)
if (not is_sorted) :
    print("==================================================================================")
    print("==================================================================================")
    print("   Your file is not sorted. Please sort your file and restart from the top.")
    print("==================================================================================")
    print("==================================================================================")

## Process Data in Batches

#### If you get assertion error, please don't proceed. Please sort your input file by ordersid and restart from the top.

In [None]:
#call processBatch
i = 0
last_batch_cut_index = 0
actual_batch_size = []
while True:
    start = time.time()
    current_batch_cut_index = processBatch(i, last_batch_cut_index, batch_size, input_path, input_filename, input_patientid_var, output_filename, output_flag)
    actual_batch_size.append(current_batch_cut_index - last_batch_cut_index)
    last_batch_cut_index = current_batch_cut_index
    if last_batch_cut_index == -1:
        break
    end = time.time()
    print(f'---------------------- finished batch {i} ----------------------')
    print(f'------------------- processed to row {last_batch_cut_index} -------------------')
    print(f'------------- processing time for this batch is {round(end - start, 2)} s -------------\n\n')
    i += 1

## Combine Batch Outputs to the Final Output
#### When running this part it may have some warning messages talking about different types, you can ignore it

In [None]:
#when run seperately, you can define total_batch_num as 
TOTAL_BATCH_NUM = i

In [None]:
#the complete output.csv
if os.path.exists(f'{input_path}{output_filename}.csv'):
    os.remove(f'{input_path}{output_filename}.csv')

output_files = [f'{output_filename}_{i}.csv' for i in range(TOTAL_BATCH_NUM)]

pd.read_csv(f'{input_path}{output_files[0]}', index_col=None, nrows=0).to_csv(input_path + f'{output_filename}.csv', mode="a", index=False)

for i in range(TOTAL_BATCH_NUM):
    chunks = pd.read_csv(f'{input_path}{output_files[i]}', index_col=None, chunksize=actual_batch_size[i], dtype={'patientid': object, 'ordersid':float})
    for chunk in chunks:
        chunk.to_csv(f'{input_path}{output_filename}.csv', mode="a", index=False, header=False)

In [None]:
#intermediate_output.csv
if os.path.exists(f'{input_path}intermediate_output.csv'):
    os.remove(f'{input_path}intermediate_output.csv')

intermediate_output_files = [f'intermediate_output_{i}.csv' for i in range(TOTAL_BATCH_NUM)]

pd.read_csv(f'{input_path}{intermediate_output_files[0]}', index_col=None, nrows=0).to_csv(input_path + f'intermediate_output.csv', mode="a", index=False)

for f in intermediate_output_files:
    chunk = pd.read_csv(f'{input_path}{f}', index_col=None)
    chunk.to_csv(f'{input_path}intermediate_output.csv', mode="a", index=False, header=False)

In [None]:
#intermediate_output_delta.csv
if os.path.exists(f'{input_path}intermediate_output_delta.csv'):
    os.remove(f'{input_path}intermediate_output_delta.csv')

intermediate_output_delta_files = [f'intermediate_output_delta_{i}.csv' for i in range(TOTAL_BATCH_NUM)]

pd.read_csv(f'{input_path}{intermediate_output_delta_files[0]}', index_col=None, nrows=0).to_csv(input_path + f'intermediate_output_delta.csv', mode="a", index=False)

for f in intermediate_output_delta_files:
    chunk = pd.read_csv(f'{input_path}{f}', index_col=None)
    chunk.to_csv(f'{input_path}intermediate_output_delta.csv', mode="a", index=False, header=False)

## Remove Batch Outputs

In [None]:
for f in output_files:
    if os.path.exists(f):
        os.remove(f)
assert(not os.path.exists(output_files[0]))

for f in intermediate_output_files:
    if os.path.exists(f):
        os.remove(f)
assert(not os.path.exists(intermediate_output_files[0]))

for f in intermediate_output_delta_files:
    if os.path.exists(f):
        os.remove(f)
assert(not os.path.exists(intermediate_output_delta_files[0]))

##IntegrationPoint04_Open
##IntegrationPoint04_Close

# Algorithm description

Using the useful_tokens field, this interpret function sequentially "reads" the terms. It picks up virus/result/test terms and they are held in a "bundle" (virus, result, test). There are also multiple modifiers that affect the way that the algorithm processes the terms. These modifiers are: final (flag to take highest priority later on), presumptive (change pos to pre), end (end reading early or skip the next save), and skip (skip the 'save when virus switches' rule once). Any time a bundle is saved, the bundle (except for test type) and the final/presumptive modifiers are cleared. If a save occurs with incomplete information, the virus defaults to an unknown virus, result defaults to negative, and test defaults to unknown test. Whenever a save happens, all of the previous tokens+labels that were read are considered to be a "segment".
<br>
- First, the xml field is processed if there is one. If a relevant virus is found, it is treated as a positive and the bundle is saved.
- Next, the algorithm will go through the labelled tokens one by one. There are different conditions for storing terms and saving the bundle when encountering a virus, a result, a special term, or an irrelevant (unlabelled) term.
    - Viruses: A relevant virus is always kept. If the virus switches, save the bundle (note: can be affected by skip modifier). If the same virus is read, save the bundle only if there is a result as well. An unknown virus is only kept if there is no current virus.
    - Results: A clear result (ind, neg, pos) is kept with hierarchy ind > neg > pos such that a neg/ind can overwrite a positive if it's close together (e.g., "not detected" becomes a neg). An unclear result (rej, can, pen) is only kept if there is no current result with hierarchy rej > can > pen. If there is already a previous result and a neg/ind is encounter, save the bundle.
    - Special terms:
        - Final: Modifier to add flag when saving to specify whether it is a final result, which takes higher priority over all others in the process_result function. Save if there is a current virus and current result. Clear the current result.
        - Presumptive: Modifier to change positive (r_pos) into presumptive-positive (r_pre).
        - End: Modifier to skip the next save. Save if there is a current virus and current result. Stop the reading if there are any results.
        - Skip: Modifier to skip the 'save when virus switches' once. This is reset when a virus switches and the current virus is skipped. Save if there is a current virus and current result. Clear the bundle.
        - Reset: Clear bundle without saving.
        - Stop: Save if there is a current virus and current result. Clear the bundle.        
    - Irrelevant terms: If two irrelevant terms (Nones) are read in a row, save the bundle if there is both a current result and virus. Also save the bundle if there is a virus and the past segment had another virus (virus_counter > 1; normally viruses tested are listed in a mpx or pcr assay). Otherwise, clear the bundle without saving and reset all the counter variables (i.e., start a new segment).
    - If the sentence ends before hitting two Nones, save any result and save the bundle if there is a virus and the past segment had another virus.