# Chief complaint processing script

- Mike Conway, PhD (Joining the University of Melbourne)
- Brian Chapman

## Background

The purpose of this lab is to expose you to a _practical problem_ in public health informatics: automatically identifying syndromes from __chief complaints__.

- Influenza-like-illness syndrome
- gastro-intestinal syndrome
- respiratory syndrome
- constitutional syndrome — from chief complaints. 



The script first opens file "cc_only.txt", then "cleans" the chief complaint string, removing punctuation, etc.  Then, a set of patterns for each of the 4 syndromes (constitutional, GI, ILI, and respiratory) are matched against the chief complaint.  The output of the script is the total number of matches for each syndrome.

The first part of the notebook consists of 6 processing steps.  The second part of the notebook consists of further, optional work.

### Background Material:  Syndromic Surveillance and Chief Complaints

- __Conventional public health disease surveillance__ relies on the routine manual or electronic filing (by clinicians and laboratories) of reportable and unusual diseases that alert public health officials to disease outbreak clusters of interest. 
    - Depends on confirmatory laboratory testing __after preliminary diagnosis by a clinician__. 
    - In many cases takes days of testing and epidemiological analysis before an outbreak is identified.
    - May rely on passive and voluntary reporting of cases 
    - May not be timely enough to provide the information needed to detect and monitor a rapidly evolving outbreak

- __Syndromic surveillance__ focuses on the early symptom (prodromal) period before clinical or laboratory confirmation 
    - may utilise both clinical and alternative data sources that reflect measurable alterations in personal behaviours that may precede a clinical diagnosis. 
    - Syndromic surveillance systems often utilise data sources that already exist but have not been designed specifically for public health surveillance purposes. 
        - prescriptions filled
        - retail drug and product sales
        - school or work absenteeism. 
        
#### In this lab we are focussing on chief complaints as a data source.

### What is a chief complaint?

A chief complaint is a short phrase entered by a triage nurse or admission clerk describing the reason for a patient's visit to a medical facility. 

### Why use chief complaints?

Chief complaints are nearly ubiquitously available in the United States, routinely generated during normal hospital operations, and available electronically during or shortly after a patients visit, thus providing a basis for real-time surveillance (for a random sample of chief complaints, see table below). 

Various clinical, research, and administrative objectives all rely on the presence of an easily identifiable and unambiguous chief complaint. However, to be useful for syndromic surveillance, the free-text triage chief complaints must first be classified into syndromic categories or into some other type of coded representation that can be manipulated by a computer. 

Hand-coding data into syndrome categories, whether performed onsite in the medical facility or offsite, requires considerable time and labor. To make chief complaint data more realistically usable for ongoing surveillance, automated syndromic categorisation applications have been developed. 

However, automated chief complaint categorisation still suffers from the challenging nature of the data (that is, prevalence of abbreviations and misspellings, context-sensitive vocabulary, inter-hospital variation) and usability considerations (for example, providing a means for refining syndrome criteria) both of which must be overcome to classify chief complaints efficiently and effectively. Furthermore, chief complaints vary in accuracy because they are recorded prior to clinician involvement in care and can therefore lack the diagnostic precision of clinician generated reports.

| |   |     |
|------|------|-----|
|   injury, toe | migraine|fell off bus  |
|confused|weakness|psychiatric evaluation|
|detox from heroin| vomiting up blood| right knee pain|
|crying/vomiting| rash on face| injured finger|
|right shoulder injury| slurred speech | head injury |
|stomach cramps | cold | tired/dizzy |
| medical| diff swallowing | followup|
|l hip pain| dental filling| labial swelling |
|body ache|optical exam|throat swelling|
|visual disturbance| earache | nausea|
|sprained ankle| grion pain| eye injuery|
|trouble urinating| palpitations | diabetic|
|injured leg| sores on back | foreign body, throat|



The use of free text in chief complaint based syndromic surveillance systems requires managing the substantial variation that results from the use of synonyms, abbreviations, acronyms, truncations, misspellings and typographic errors. 

Failure to detect these linguistic variations could result in missed cases, and traditional methods for capturing this variation require ongoing labor intensive maintenance. 

**In this lab we will use simple string matching techniques to identify chief complaints associated with several syndromes of interest (i.e. Influenza-like-illness syndrome, gastro-intestinal syndrome, respiratory syndrome, and constitutional syndrome).**


## Syndromes

We will be looking at 4 syndromes in this lab:

* Influenza-Like-IllnessSyndrome–characteristic symptoms include fever, chills,and malaise
* Constitutional Syndrome – characteristic symptoms include fever, lethargy, and myalgia 
* Respiratory Syndrome — e.g. cough, gasping, and shortness of breath 
* Gastrointestinal Syndrome – e.g. abdominal pain, vomiting, and nausea

Note that symptoms can belong to multiple syndromes. This is partly what makes syndromic surveillance difficult.
Our goal is to automatically classify relevant chief complaints into appropriate syndromic categories



In [None]:
from pypop.utils import *
from pypop.view import *

In [None]:
cfs = get_chief_complaint_data()

### How many cases do we have?

In [None]:
cfs.shape

In [None]:
show_data(cfs)

You can see that the text is quite messy, with lots of  extraneous characters (e.g. tab character)

## 2. Strip punctuation from chief complaints and lowercase text

Python comes with a string defining common punctuations. In the cell below I convert this to a list. You can add elements to the list with the `append` method if you wish. For example

```Python
punctuation.append("😂")
```

In [None]:
punctuation = list(string.punctuation)
punctuation.append("😂")

print(len(punctuation))
for i in range(len(punctuation)):
    p = punctuation[i]
    print(p, end=" ")
    if (i+1) % 5 == 0:
        print()
    
    

In [None]:
cfs = get_cleaned_cfs(cfs, punctuation)
show_data(cfs)

At the end of this process, the clean_list variable contains individual chief complaints with punctuation and trailing white space removed, with all text converted to lowercase.

## 3.  Create patterns for matching syndromes

In this stage we identify patterns associated with each syndrome (e.g.
constitutional syndrome is associated with dizziness ("dizz"), faintness ("faint").

In the cell below add terms for __food poisoning (FP)__, __asthma (ASTHMA)__, and __neurological syndrome (NEURO)__ Also feel free to edit the lists for the other syndromes. 

#### Please note the syntax:

- A list is enclosed by square brackets
- Elements within lists are separated by commas
- Strings are defined with quotation marks

You may find the following DataFrame which contains words and their frequencies within the chief complaints helpful

In [None]:
show_data(get_word_counts(cfs))

In [None]:
syndromes = {}
syndromes["CON"] =   ["dizz", "faint", "malaise", "irritable", "letharg"]
syndromes["GI"]  =   ["vomit", "nausea", "loose stool", "spitting up", "watery stool"]
syndromes["RESP"] =  ["asthma", "cough", "gasp", "breath", "wheez"]
syndromes["ILI"]  =   ["chill", "fatigue", "arthralgia", "myalgia", "malaise"]
syndromes["FP"] = []
syndromes["ASTHMA"] = []
syndromes["NEURO"] = []

In [None]:
cfs2 = map_syndromes(cfs, syndromes)
show_data(cfs2)

In [None]:
cfs3 = get_syndrome_counts(cfs2, syndromes)
show_data(cfs3)

In [None]:
cfs3.plot.bar()

## Find Longest Chief Complaint ##

Chief complaints vary considerably in length depending on the person who writes them, the particular institution, and the EHR used. Some are highly abbreviated, and some are full sentences. 


In [None]:
cfs4 = get_sorted_df()

In [None]:
show_data(cfs4)

#### Histogram of chief complaint lengths

In [None]:
cfs4.plot.hist(bins=50)