# **traumaScanner**

**Date:** 09-13-2023

**Author**: Meghan Hutch

**Objective**: Identify patients with post-traumatic brain hemorrhage via the use of regular expressions.

In [None]:
import os
import re
import pandas as pd
import numpy as np
import seaborn as sns
import datetime as datetime
import matplotlib.pyplot as plt

In [None]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

### Load tbiExtractor_suid dataset

We will load in the dataset that has been prepared in the `notebooks/text-mining/merge_tbiExtractor_suid.ipynb`

In [None]:
# dataframe with notes to evaluate
#tbiExtractor_results_all = pd.read_csv('/share/nubar/Neurotrauma/hematoma_expansion/data/processed_data/tbiExtractor_suid.csv')

radiology_reports_df = pd.read_csv('/share/nubar/Neurotrauma/hematoma_expansion/tbi_cohort/data/processed/suid_rad_reports.csv')

## Data pre-processing 

**Create a new cleaned up dataframe with only initial analysis variables of interest**

In [None]:
radiology_reports_df_analyze = radiology_reports_df[['order_reason', 'report_num_temp', 'unique_study_id', 'report']].drop_duplicates()

In [None]:
# review number of unique reports and study ids
print(len(radiology_reports_df_analyze[['report_num_temp', 'unique_study_id']].drop_duplicates()))
print(len(radiology_reports_df_analyze[['unique_study_id']].drop_duplicates()))

**Append `order_reason` text to `report`**

In [None]:
len(radiology_reports_df_analyze[radiology_reports_df_analyze['order_reason'].isnull()])

In [None]:
# add order_reason to the report; we use fillna() to handle the handful of missing order_reasons (n = 4)
#radiology_reports_df_analyze['report'] = 'Order reason: ' + radiology_reports_df_analyze['order_reason'].fillna('') + radiology_reports_df_analyze['report']

In [None]:
len(radiology_reports_df_analyze[radiology_reports_df_analyze['report'].isnull()])

In [None]:
radiology_reports_df_analyze[['order_reason', 'report_num_temp', 'unique_study_id', 'report']].head()

## Find post-traumatic hemorrhage reports

This next set of code is devised from the iterative testing performed in `notebooks/text-mining/findTBI_reports.ipynb`. Here, we will employ a series of regular expressions (regex) to identify radiology reports with post-traumatic hemorrhage.

### Identify potential trauma patients

First, we will identify patients who are being evaluated after a traumatic injury. This list was originally composed with the keywords specified in `data/trauma/search_criteria.txt`. I added a few other words that appeared during my testing. I also have implemented a for-loop to help identify words that need to be exactly matched in order to avoid false positive matches (e.g. 'stab' returning reports with the word 'stable'; 'hit' returning reports with the word 'white'). I also removed words like 'injury' which was retrieving reports which were false positives for traumatic brain injury, and instead were matching on phrases such as 'ischemic injury' or 'post-operative injury'. 

*Note: Some of the words in the partial matching `trauma_targets` list could be moved to `exact_match_words` list, but my original method attempted to only perform partial matching which led to the identification of some false positive reports. Thus, any of the words that absolutely needed to be exact matches were listed in the `exact_match_words` list 

In [None]:
## partial text-matching
# list of trauma related key-words for partial-text matching
trauma_targets = ['trauma', 'fall', 'lacera', 'assault', 'gun', 'kick', 'gsw', 'punch',
                  'shot', 'mva', 'mvc', 'bike', 'vehicle', 'vehicular', 'collision', 'automobile']

# identify reports that have a partial-text match to at least one of the trauma_targets
potential_tbi_trauma = radiology_reports_df_analyze[radiology_reports_df_analyze['report'].str.contains('|'.join(trauma_targets), case = False, regex = True)]
print(potential_tbi_trauma[['unique_study_id']].nunique())

## exact-text matching
# initialize an empty DataFrame to store the results
exact_results_df = pd.DataFrame()

# Perform exact matches for words in exact_match_words list
# note: if we want exact word match - \b \b adds boundaries. E.g. \bstab\b will return 'stab' and not 'stable'
# added additional words 'brain injury', 'head injury', 'fell', 'stabbing' based on prior sensitivity analyses
exact_match_words = ['auto', 'accidents', 'accident', 'hit', 'stab', 'stabbed' 'stabbing', 'slip', 'slipped', 'struck', 'car', 
                     'brain injury', 'head injury', 'fell', 'contrecoup', 'contracoup', 'coup', 'tSAH']
for word in exact_match_words:
    exact_match_result = radiology_reports_df_analyze[radiology_reports_df_analyze['report'].str.contains(fr'\b{word}\b', case=False, regex=True)]
    exact_results_df = pd.concat([exact_results_df, exact_match_result])

# list the dataframe of reports that matched our exact or partial-text matching criteria
report_matches = [potential_tbi_trauma, exact_results_df]

# combine dataframes and drop any duplicated reports which may appear in both
potential_tbi_trauma_reports = pd.concat(report_matches).drop_duplicates()

# print out sample size of patients/reports that were matched
print(len(potential_tbi_trauma_reports))
print(potential_tbi_trauma_reports['unique_study_id'].nunique())
print(len(potential_tbi_trauma_reports[['unique_study_id', 'report_num_temp']].drop_duplicates()))

**Create functions for text matching**

The following functions will be applied to our corpus of radiology reports. The function `report_has_required_words()` takes in the unstructured radiology text report. Next it splits up each sentence of the report and uses the `has_all_required_words()` function to iteratively evaluate whether the list of target words (`pattern_sets`) is found in the sentence. If a sentence with a target word is found, the entire report (or row of the dataframe) will be returned.

In [None]:
# functions to apply regex; created with help from chatgpt

# Function to check if a sentence contains all required words from any pattern set
def has_all_required_words(sentence, pattern_sets):
    return any(all(re.search(pattern, sentence) for pattern in pattern_set) for pattern_set in pattern_sets)

# Function to check if a report contains at least one sentence with all required words from any pattern set
def report_has_required_words(report, pattern_sets):
    sentences = report.split('.')
    return any(has_all_required_words(sentence, pattern_sets) for sentence in sentences)

### Remove likely/uncertain false positives for trauma

When devising this method, we noticed that some reports were returned if there was a phrase such as 'No history of trauma'. Previously, I thought the downstream regex for stratifying patients with and without hemorrhage would ensure these patients were not included in our post-traumatic hemorrhage cohort, but sometimes, these patients have what appears to be spontaneous / post-operative hemorrhage from non-trauma sources. 

Thus, this step attempts to ensure that we only maintain patients who we are highly confident are being evaluated for traumatic brain injury.

***Note***: 

This method will remove some patients who did have a trauma type injury as mentioned. For example, a patient may have a MVC, however, the radiologist might write: "no evidence of recent traumatic injury". This is okay, because the overall objective is to identify patients with post-traumatic hemorrhage. 

In [None]:
pattern_sets = [
    # the following regex employs a positive lookahead assertion; 
    # it makes sure to remove sentences where No comes before the words (recent|known|history|...), which also come before (trauma|traumatic)
    # see additional regex notes towards bottom of notebook
    [r'(?i)\b(no|negative)\b(?=.*(\b(recent|known|obvious|history|reported|definite)\b.*\b(trauma|traumatic)\b))'],
    [r'(?i)\b(vascular accident)\b'],
    [r'(?i)\b(anoxic brain injury)\b'],
]

potential_trauma_fp = potential_tbi_trauma_reports[potential_tbi_trauma_reports['report'].apply(report_has_required_words, pattern_sets=pattern_sets)]

In [None]:
len(potential_trauma_fp)

In [None]:
# can review a random sample of the reports we will exclude 
#potential_trauma_fp.sample(n = 5, random_state=1308).reset_index(drop=True)

**Remove the likely FP reports from the `potential_tbi_trauma_reports`**

In [None]:
# remove the reports that match the regex pattern above
potential_tbi_trauma_reports = potential_tbi_trauma_reports[~(potential_tbi_trauma_reports['report'].apply(report_has_required_words, pattern_sets=pattern_sets))]

# print out sample size of patients/reports that were matched
print(len(potential_tbi_trauma_reports))
print(potential_tbi_trauma_reports['unique_study_id'].nunique())
print(len(potential_tbi_trauma_reports[['unique_study_id', 'report_num_temp']].drop_duplicates()))

### Identify post-traumatic hemorrhage

The set of regular expressions curated below aims to leverage common phrases and templated language that radiologists use to describe the ***absence*** of hemorrhage or intrancranial abnormalities. This list was devised during my chart reviews and sensitivity analyses as I reviewed reports and saw the type of language commonly repeated to describe the absence of any hemorrhage.

**Regex notes:**

*copied from chatgpt, but evaluated their validity with other online resources to guide my understanding*

Let's take the following regex as an example:

`r'(?i)\b(no|negative)\b(?!.*\b(new|additional)\b).*?\b(evidence|acute|intracranial)\b.*?\b(trauma|traumatic|hemorrhage|hematoma)\b'`

* `r`: This is a Python string prefix that indicates that the following string should be treated as a raw string, which is often used with regular expressions.

* `(?i)`: This is a regex flag that makes the pattern case-insensitive. It allows the regex to match both uppercase and lowercase letters. So, "No," "no," "NO," "Negative," "negative," etc., will all be matched.

*Note: This looks to affect all following words; not just the capture group (no | negative) right after it*

* `\b`: This is a word boundary anchor. It matches the position where a word starts or ends. It ensures that the words "no" or "negative" are matched as whole words, not as part of another word. 

* `\b(no|negative)\b`: This part is a word boundary (`\b`) followed by a non-capturing group `(?:...)` that matches either "no" or "negative." The word boundary ensures that it matches these words as whole words. The `|` symbol acts as an OR operator, allowing either word to match.

* `(?!.*\b(new|additional)\b)`: This is a negative lookahead assertion `(?!...).` It checks for the absence of the following condition:

   *  `.*\b(new|additional)\b`: It matches any characters `(.*)` that contain the whole words "new" or "additional" (with word boundaries). If this condition is met, the negative lookahead fails, meaning that sentences containing "new" or "additional" are excluded.
   
*  `.*?`: This part matches any characters (as few as possible) between the preceding and following parts.

* `\b(evidence|acute|intracranial)\b`: This part matches "evidence," "acute," or "intracranial" as whole words. The `\b` word boundaries ensure they are matched entirely.

`.*?`: Again, this part matches any characters (as few as possible) before the next specified term.

`\b(trauma|traumatic|hemorrhage|hematoma)\b`: This part matches one of the specified medical terms as whole words, including "trauma," "traumatic," "hemorrhage," or "hematoma." The `\b` word boundaries ensure that these terms are matched entirely.

In summary, this regular expression is designed to match sentences where:

* "no" or "negative" appears as a whole word at the beginning.
* "new" or "additional" is excluded from the entire sentence.
* "evidence," "acute," or "intracranial" appears before "trauma," "traumatic," "hemorrhage," or "hematoma."
* The words are matched as complete words due to the use of word boundaries `\b`.
* This regex is intended to identify specific patterns in text data, such as medical reports, where these conditions need to be met.

In [None]:
test_pattern_sets = [
    #[r'(?i)\b(remote)\b(?=.*(\b(trauma)\b))'],
    [r'(?i)\b(unremarkable|negative)\b(?=.*(\b(exam|head\sCT|CT|study)\b))']
]

potential_trauma_fp = potential_tbi_trauma_reports[potential_tbi_trauma_reports['report'].apply(report_has_required_words, pattern_sets=test_pattern_sets)]
print(len(potential_trauma_fp))
potential_trauma_fp.iloc[:10]

In [None]:
# Define the regular expression pattern for each of the expressions
# this set of patterns will be applied to each sentence of each report
pattern_sets = [
    # detects sentences that contains the word no or negative before at least one of the following words (evidence|acute|negative), which also occur before the words (trauma|traumatic|hemorrhage|hematoma)
    # importantly, we ensure that if the word 'new' or 'additional' is present that we do not exclude it. Often these words are part of phrases that suggest there is hemorrhage
    [r'(?i)\b(no|negative)\b(?!.*\b(new|additional)\b).*?\b(evidence|acute|intracranial)\b.*?\b(trauma|traumatic|hemorrhage|hematoma)\b'],
    # detect matches without the previous qualifier (evidence|acute|intracranial) from above
    [r'(?i)\b(no|negative)\b(?:(?!(\bnew\b|\badditional\b)).)*\b(trauma|traumatic|hemorrhage|hematoma)\b'],
    # detects no acute findings
    # note: adding 'intracranial' as a prefix before 'findings' leads to two false positives; thus will not combine the two subsequent regex
    [r'(?i)\b(no|negative)\b(?=.*(\b(acute)\b.*\b(findings)\b))'],
    # detects no CT abnormalities
    [r'(?i)\b(no|negative)\b(?=.*(\b(intracranial|acute)\b.*\b(abnormality|abnormalities)\b))'],
    # detects no abnormality
    [r'(?i)\b(no|negative)\b(?=.*(\b(abnormality|abnormalities)\b))'],
    # detects negative|unremarkable head CT / negative finding
    [r'(?i)\b(unremarkable|negative)\b(?=.*(\b(exam|head\sCT|CT|study)\b))'],
    # detects normal exam
    [r'(?i)\b(normal study|normal CT|normal head CT|normal exam|normal head CT|normal noncontrast|normal plain)\b'],
    # detects no intracranial process - including acute might return us old scans that we may want to keep
    [r'(?i)\b(no|negative|without\sevidence)\b(?=.*(\b(intracranial)\b.*\b(process|pathology)\b))'],
    # remove exact phrases of the following:
    [r'(?i)\b(without acute intracranial abnormality)\b'],
    [r'(?i)\b(without evidence for acute abnormality)\b'],
    [r'(?i)\b(without acute abnormality)\b']

]

#### testing code

In [None]:
### code provided with help of chatgpt
### this will print out the 
import re

# Function to check if a sentence contains all required words from any pattern set
def has_all_required_words_debug(sentence, pattern_sets):
    #print(sentence)
    for pattern_set in pattern_sets:
        for pattern in pattern_set:
            if re.search(pattern, sentence):
                return pattern  # Return the matching pattern
    return None  # Return None if no pattern matched

# Function to check if a report contains at least one sentence with all required words from any pattern set
def report_has_required_words_debug(report, pattern_sets):
    sentences = report.split('.')
    for sentence in sentences:
        pattern = has_all_required_words_debug(sentence, pattern_sets)
        if pattern:
            print(sentence)
            return pattern  # Return the matching pattern
    return None  # Return None if no pattern matched

In [None]:
# create fake reports to text different regex
d = {'report': ['there are no new acute of hemorrhage or new extra-axial fluid identified.']}
text_ex = pd.DataFrame(data = d)
#print(text_ex)

text_ex['report'].apply(report_has_required_words_debug, pattern_sets=pattern_sets)

#### Apply regex statements to identify patients without post-traumatic hemorrhage

In [None]:
# this will identify all reports with no hemorrhage
# for this process; we will create an abbreviated version of the larger dataset `potential_tbi_trauma_reports` and apply our `pattern_set`
potential_tbi_no_hem = potential_tbi_trauma_reports[['unique_study_id', 'report_num_temp', 'report']].drop_duplicates()
potential_tbi_no_hem = potential_tbi_trauma_reports[potential_tbi_no_hem['report'].apply(report_has_required_words, pattern_sets=pattern_sets)]

In [None]:
potential_tbi_no_hem.head()

In [None]:
print(len(potential_tbi_no_hem[['unique_study_id']].drop_duplicates()))
print(len(potential_tbi_no_hem[['unique_study_id', 'report_num_temp']].drop_duplicates()))

#### Return patients with likely hemorrhage

Next, we will merge the patients with potentially no hemorrhage, with the original `potential_tbi_trauma_reports` dataset, in order to return the patients with likely post-traumatic hemorrhage

In [None]:
potential_tbi_hem = pd.merge(potential_tbi_trauma_reports[['unique_study_id', 'report_num_temp', 'report']].drop_duplicates(), 
                             potential_tbi_no_hem[['unique_study_id', 'report_num_temp', 'report']].drop_duplicates(), 
                             indicator = True, how = 'left').query('_merge=="left_only"').drop('_merge', axis=1)


# before dropping any rows
# note: adding up patient ID of the potential_tbi_hem and the potential_tbi_no_hem is not a good check that our merging worked because patients could end up in both groups depending on how/when there reports were taken
# note: however, the number of reports from both new dataframes should match original potential_tbi_trauma_reports
print(len(potential_tbi_hem[['unique_study_id']].drop_duplicates()))
print(len(potential_tbi_hem[['unique_study_id', 'report_num_temp']].drop_duplicates()))

print(len(potential_tbi_no_hem[['unique_study_id']].drop_duplicates()))
print(len(potential_tbi_no_hem[['unique_study_id', 'report_num_temp']].drop_duplicates()))

print(len(potential_tbi_trauma_reports[['unique_study_id']].drop_duplicates()))
print(len(potential_tbi_trauma_reports[['unique_study_id', 'report_num_temp']].drop_duplicates()))

**Rescue reports**

Next, we will rescue reports that may have been excluded do the regex rules that excluded reports containing sentences such as:

- "No evidence of additional hemorrhage"
- "No interval change in the amount of hemorrhage"

In [None]:
# evaluate whether we are excluding patinets with "no additional hemorrohage"
# try ensuring "No" prior to (additional|change|decrease) in (hematoma|hemorrhage|contusion)
# our regex also enables capturing `change, changed, changing (and similar matching with increase and decrease)
# we will also include the word hemorrhage; in this instance, we want to try and identify any reports that may have been false negative for post-traumatic hemorrhage
# contusion seems pretty sensitive for brain trauma/potential hemorrhage 
pattern_sets = [
    [r'(?i)\b(no|negative)\b(?=.*(\b(additional|(chang|increas|decreas)(e|ed|ing))\b.*\b(hematoma|hemorrhage|hemorrhagic|contusion)\b))']
]

rescue_additional_reports = potential_tbi_no_hem[potential_tbi_no_hem['report'].apply(report_has_required_words, pattern_sets=pattern_sets)]
print(len(rescue_additional_reports))

In [None]:
rescue_additional_reports.sample(n = 5, random_state=1308).reset_index(drop=True)

In [None]:
# resave the potential_tbi_no_hem to remove the rescued reports
potential_tbi_no_hem_v2 = potential_tbi_no_hem[~potential_tbi_no_hem['report'].apply(report_has_required_words, pattern_sets=pattern_sets)]

In [None]:
# testing whether secondary hemorrhage can be a key phrase to identify post-traumatic hemorrhage reports
test_pattern_sets = [
   # [r'(?i)\b(secondary)\b(?=.*(\b(hematoma|hemorrhage|contusion)\b))'],
    #[r'(?i)\b(subarachnoid|subdural|intraparenchymal|intraventricular)\b(?=.*(\b(hematoma|hemorrhage|contusion)\b))'],
    [r'(?i)\b(contracoup)\b'] 
]

potential_trauma_fn = potential_tbi_no_hem_v2[potential_tbi_no_hem_v2['report'].apply(report_has_required_words, pattern_sets=test_pattern_sets)]
print(len(potential_trauma_fn))
potential_trauma_fn.iloc[:10]

In [None]:
# add reports to the `potential_tbi_hem` dataset
tbi_reports_list = [potential_tbi_hem, rescue_additional_reports]

post_traumatic_hem_reports = pd.concat(tbi_reports_list)
len(post_traumatic_hem_reports)

In [None]:
# evaluate whether we've correctly concatenated the extra reports
len(potential_tbi_hem) + len(rescue_additional_reports) == len(post_traumatic_hem_reports)

#### Tally total number of reports and patients

In [None]:
print(len(post_traumatic_hem_reports[['unique_study_id']].drop_duplicates()))
print(len(post_traumatic_hem_reports[['unique_study_id', 'report_num_temp']].drop_duplicates()))

## Chart Review Validation

Next, we will randomly sample reports in order to perform a manual chart review and validation of this method

In [None]:
# need to stratify reports by post_traumatic_hemorrhage, potential_tbi_no_hem
post_traumatic_hem_reports_to_validate = post_traumatic_hem_reports.sample(n = 50, random_state=1308).reset_index(drop=True)

In [None]:
#post_traumatic_hem_reports_to_validate

In [None]:
# automatically save a file with the current date and time
#date_to_save = f'/share/nubar/Neurotrauma/hematoma_expansion/data/chart_review/{datetime.datetime.now().strftime("%Y%m%d_%H%M")}_potential_tbi_to_review.csv'
#post_traumatic_hem_reports_to_validate.to_csv(date_to_save, index = False)

In [None]:
non_post_traumatic_hem_reports_to_validate = potential_tbi_no_hem_v2.sample(n = 50, random_state=1308).reset_index(drop=True)

In [None]:
#non_post_traumatic_hem_reports_to_validate.to_csv('/share/nubar/Neurotrauma/hematoma_expansion/data/chart_review/2023.10.10_potential_tbi_non_hem_chart_review.csv')

#### Save list of the identified patients and reports

In [None]:
post_traumatic_hem_reports_unique = post_traumatic_hem_reports[['unique_study_id', 'report_num_temp']].drop_duplicates()

In [None]:
post_traumatic_hem_reports.nunique()

In [None]:
# automatically save a file with the current date and time
#date_to_save = f'/share/nubar/Neurotrauma/hematoma_expansion/data/processed_data/{datetime.datetime.now().strftime("%Y%m%d_%H%M")}_potential_trauma_patient_reports.csv'

date_to_save = f'/share/nubar/Neurotrauma/hematoma_expansion/tbi_cohort/data/processed/{datetime.datetime.now().strftime("%Y%m%d_%H%M")}_potential_trauma_patient_reports.csv'
post_traumatic_hem_reports_unique.to_csv(date_to_save, index = False)

**Evaluate differences from previous iteration**

The below code facilitates checking the addition or subtraction of reports based on different changes to regex / filtering rules

In [None]:
#tbiExtractor_results_all = pd.read_csv('/share/nubar/Neurotrauma/hematoma_expansion/data/processed_data/tbiExtractor_suid.csv')

In [None]:
# select a previous list of reports 
#post_traumatic_hem_reports_unique_orig = pd.read_csv('/share/nubar/Neurotrauma/hematoma_expansion/data/processed_data/20231017_1635_potential_trauma_patient_reports.csv')
post_traumatic_hem_reports_unique_orig = pd.read_csv('/share/nubar/Neurotrauma/hematoma_expansion/data/processed_data/20231024_1615_potential_trauma_patient_reports.csv')

In [None]:
### check differences between the current and past iteration
diff_reports = pd.merge(post_traumatic_hem_reports_unique_orig, 
                        post_traumatic_hem_reports_unique,
                        how = 'outer',
                        indicator = True)

# right_only lists the reports that were most recently identified in the current, but not the previous iteration
# left_only lists the reports that are no longer included in the current iteration
print(len(diff_reports[diff_reports['_merge'] == 'left_only']))

In [None]:
# select either `left_only` or `right_only` to review
reports_removed = diff_reports[diff_reports['_merge'] == 'left_only']

# merge discordant reports with the tbiExtractor results in order to review the radiology report
diff_reports_to_eval = pd.merge(radiology_reports_df_analyze[['unique_study_id', 'order_reason', 'report_num_temp', 'report']].drop_duplicates(),
         reports_removed,
         how = 'inner')

In [None]:
diff_reports_to_eval.iloc[:11]

**Which patients were newly identified as trauma patients**

In [None]:
### check differences between the current and past iteration
diff_id_reports = pd.merge(post_traumatic_hem_reports_unique_orig[['unique_study_id', 'report_num_temp']].drop_duplicates(), 
                        post_traumatic_hem_reports_unique[['unique_study_id', 'report_num_temp']].drop_duplicates(),
                        how = 'outer',
                        indicator = True)

# right_only lists the reports that were most recently identified in the current, but not the previous iteration
# left_only lists the reports that are no longer included in the current iteration
print(len(diff_id_reports[diff_id_reports['_merge'] == 'right_only']))

In [None]:
# select either `left_only` or `right_only` to review
reports_id_removed = diff_id_reports[diff_id_reports['_merge'] == 'right_only']

# merge discordant reports with the tbiExtractor results in order to review the radiology report
diff_id_reports_to_eval = pd.merge(radiology_reports_df_analyze[['unique_study_id', 'order_reason', 'report_num_temp', 'report']].drop_duplicates(),
         reports_id_removed,
         how = 'inner')