# Predicting Gout During Emergency Room Visit:  <i>Is the patient potentially suffering from Gout?</i>  

## Scope

The scope of this project is corpora from the Deep South.  The demographics of the population from which they were derived are 54% female, and 46% male, 55% Black, 40% White, 2% Hispanic, and 1% Asian. Age distribution was 5% between ages 1-20 years, 35% between ages 21-40 years, 35% between ages 41-60 years, 20% between ages 61-80 years, and 5% between ages 81-100 years.

## Data

The data is extracted in csv format from the MIMIC-III (Medical Information Mart for Intensive Care III) database.  Details can be found at https://physionet.org/content/emer-complaint-gout/1.0/.   Acces to the database may be requested at (https://mimic.physionet.org/gettingstarted/access/). 

The data provided by the MIMIC database consists of 2 corpora of free text collected by the triage nurse and recorded as the "Chief Complaint".  Each complaint contains up to 282 characters in length and was collected from 2019 to 2020 at an academic medical center in the Deep South.  The 2019 corpora, "GOUT-CC-2019-CORPUS", consists of 300 chief complaints selected by the presence of the keyword "gout". The 2020 corpora, "GOUT-CC-2020-CORPUS" contains 8037 chief complaints collected from a single month in 2020. The chief complaints included in both corpora were selected based on the presence of the keyword "gout".

## Cleaning and Analysis

**Import Data**

In [1]:
import pandas as pd

syn2019 = pd.read_csv('Data/GOUT-CC-2019-CORPUS-SYNTHETIC.csv')
syn2020 = pd.read_csv('Data/GOUT-CC-2020-CORPUS-SYNTHETIC.csv')


**Data Description**
* 2 csv files
    * 2019 : 300 records
    * 2020 : 8037 records
    * Identical layouts and formats: all text, 3 columns
    <br><br>
* 3 Columns:  ["Chief Complaint", "Predict", "Consensus"]
    * <b>Chief Complaint:</b> 
        * text format
        * up to 282 Chars
        * nurse recorded patient complaint
    * <b>Predict:</b> 
        * text format
        * single char ('-','U','Y','N')
        * prediction of Gout by the ER Physician
    * <b>Consensus:</b> 
        * textformat
        * single char ('-','U','Y','N')
        * determination of Gout by the Rhuematologist
    <br>
* 
          - : Null
          U : Unknonw
          Y : Yes
          N : Gout

## Format Data

In [2]:
print(syn2019.head())

                                     Chief Complaint Predict Consensus
0  "been feeling bad" last 2 weeks & switched BP ...       N         -
1  "can't walk", reports onset at 0830 am. orient...       Y         N
2  "dehydration" Chest hurts, hips hurt, cramps P...       Y         Y
3  "gout flare up" L arm swelling x 1 week. denie...       Y         Y
4  "heart racing,"dyspnea, and orthopnea that has...       N         -


In [3]:
print(syn2020.head())

                                     Chief Complaint Predict Consensus
0  "I dont know whats going on with my head, its ...       N         -
1  "i've been depressed for a few weeks now, i'm ...       N         -
2  Altercation while making arrest, c/o R hand pa...       N         N
3  Cut on L upper thigh wtih saw. Bleeding contro...       N         N
4   Dysuria x1 week. hx: hysterectomy, gerd, bipolar       N         -


**Combine the 2 files**

In [4]:
# Combine the files into 1 dataframe
df = pd.concat([syn2019, syn2020], axis=0).reset_index(drop=True)
print(df.shape)

(8437, 3)


**Review records for null value '-' in the files**

In [5]:
print(df['Predict'].value_counts(sort=False))

Y     111
-       2
U     156
N    8168
Name: Predict, dtype: int64


In [6]:
print(df['Consensus'].value_counts(sort=False))

Y      95
-    7976
U      16
N     350
Name: Consensus, dtype: int64


**Remove records that contain null's '-' in both the 'Predict' and 'Consensus' columns.**

In [7]:
print( df[(df.Consensus == '-') & (df.Predict == '-')])

                                        Chief Complaint Predict Consensus
7799  Right lower back pain that radiates down leg t...       -         -
7857  pain to posterior upper leg x 3 days, seen at ...       -         -


## Clean Data

   * Remove records that contain null values in both of the Predict and Consensus columns.
   * Fill Consensus null values ( - ) with Predict values
   * Change all chars to lowercase
   * Remove punctuation
   * Remove words containing numbers

**Remove records with double 'null' values, records with '-' in both Consensus and Predict.**

In [8]:
df = df[(df.Consensus != '-') | (df.Predict != '-')]
print(df.shape)

(8435, 3)


The predict column contains a value agreed upon by a panel of physicians while the consensus is the 'Rheumatologist findings, patients who did require follow-up with a Rhuematologist will be included using the predict values.

**Fill null values in consensus with predict value**

In [9]:
for a in df['Consensus']:
    if a == '-':
        df['Consensus'] = df['Predict']

In [10]:
print(df['Consensus'].value_counts(sort=False))

Y     111
U     156
N    8168
Name: Consensus, dtype: int64


In [11]:
df = df.drop(columns=['Predict'])

In [12]:
df = df.rename(columns={'Chief Complaint': 'corpus', 'Consensus': 'target'})
df

Unnamed: 0,corpus,target
0,"""been feeling bad"" last 2 weeks & switched BP ...",N
1,"""can't walk"", reports onset at 0830 am. orient...",Y
2,"""dehydration"" Chest hurts, hips hurt, cramps P...",Y
3,"""gout flare up"" L arm swelling x 1 week. denie...",Y
4,"""heart racing,""dyspnea, and orthopnea that has...",N
...,...,...
8432,"stepped on a nail at home with right foot, pai...",N
8433,""" I was having a breakdown."" R/T stress and de...",N
8434,"""I tried to jump in front of a car"" Pt states ...",N
8435,Abdominal pain x 1 week. Denies PMH,N


In [13]:
import transformers

In [14]:
print(dir(transformers))

['ALBERT_PRETRAINED_CONFIG_ARCHIVE_MAP', 'ALBERT_PRETRAINED_MODEL_ARCHIVE_LIST', 'ALL_PRETRAINED_CONFIG_ARCHIVE_MAP', 'Adafactor', 'AdamW', 'AdamWeightDecay', 'AdaptiveEmbedding', 'AddedToken', 'AlbertConfig', 'AlbertForMaskedLM', 'AlbertForMultipleChoice', 'AlbertForPreTraining', 'AlbertForQuestionAnswering', 'AlbertForSequenceClassification', 'AlbertForTokenClassification', 'AlbertModel', 'AlbertPreTrainedModel', 'AlbertTokenizer', 'AlbertTokenizerFast', 'AutoConfig', 'AutoFeatureExtractor', 'AutoModel', 'AutoModelForCausalLM', 'AutoModelForImageClassification', 'AutoModelForMaskedLM', 'AutoModelForMultipleChoice', 'AutoModelForNextSentencePrediction', 'AutoModelForPreTraining', 'AutoModelForQuestionAnswering', 'AutoModelForSeq2SeqLM', 'AutoModelForSequenceClassification', 'AutoModelForTableQuestionAnswering', 'AutoModelForTokenClassification', 'AutoModelWithLMHead', 'AutoTokenizer', 'AutomaticSpeechRecognitionPipeline', 'BART_PRETRAINED_MODEL_ARCHIVE_LIST', 'BERT_PRETRAINED_CONFIG_A

In [22]:
from transformers import pipeline
classifier = pipeline("sentiment-analysis")

In [23]:
classifier("I've been waiting for a course my whole life.")

[{'label': 'POSITIVE', 'score': 0.9450945854187012}]

In [24]:
classifier([
    "I've been waiting for a HuggingFace course my whole life.", 
    "I hate this so much!"
])

[{'label': 'POSITIVE', 'score': 0.9598048329353333},
 {'label': 'NEGATIVE', 'score': 0.9994558095932007}]

In [25]:
zero = pipeline("zero-shot-classification")

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1154.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1629486723.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=26.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=898822.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=456318.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1355863.0), HTML(value='')))




In [29]:
zero('I have pain on left side',candidate_labels=["gout", "arthritis", "pregnant"])

{'sequence': 'I have pain on left side',
 'labels': ['arthritis', 'gout', 'pregnant'],
 'scores': [0.6223249435424805, 0.3146713078022003, 0.06300368160009384]}

In [31]:
zero(df['corpus'][0],candidate_labels=["gout", "arthritis", "pregnant"])

{'sequence': '"been feeling bad" last 2 weeks & switched BP medications last week & worried about BP PMHx: CHF, HTN, gout, 3 strokes, DM',
 'labels': ['gout', 'arthritis', 'pregnant'],
 'scores': [0.9905892610549927, 0.007834453135728836, 0.0015763300471007824]}

In [33]:
from transformers import pipeline

question_answerer = pipeline("question-answering")
question_answerer(
    question=df['corpus'][0],
    context="Diagnosis of gout?"
)

{'score': 0.6344756484031677,
 'start': 0,
 'end': 17,
 'answer': 'Diagnosis of gout'}

In [34]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    df['corpus'][0],
    max_length=30,
    num_return_sequences=2,
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Input length of input_ids is 35, but ``max_length`` is set to 30.This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``.


[{'generated_text': '"been feeling bad" last 2 weeks & switched BP medications last week & worried about BP PMHx: CHF, HTN, gout, 3 strokes, DMF'},
 {'generated_text': '"been feeling bad" last 2 weeks & switched BP medications last week & worried about BP PMHx: CHF, HTN, gout, 3 strokes, DMH'}]