# Predicting Gout During Emergency Room Visit:  <i>Is the patient potentially suffering from Gout?</i>  

## Scope

The scope of this project is corpora from the Deep South.  The demographics of the population from which they were derived are 54% female, and 46% male, 55% Black, 40% White, 2% Hispanic, and 1% Asian. Age distribution was 5% between ages 1-20 years, 35% between ages 21-40 years, 35% between ages 41-60 years, 20% between ages 61-80 years, and 5% between ages 81-100 years.

## Data

The data is extracted in csv format from the MIMIC-III (Medical Information Mart for Intensive Care III) database.  Details can be found at https://physionet.org/content/emer-complaint-gout/1.0/.   Acces to the database may be requested at (https://mimic.physionet.org/gettingstarted/access/). 

The data provided by the MIMIC database consists of 2 corpora of free text collected by the triage nurse and recorded as the "Chief Complaint".  Each complaint contains up to 282 characters in length and was collected from 2019 to 2020 at an academic medical center in the Deep South.  The 2019 corpora, "GOUT-CC-2019-CORPUS", consists of 300 chief complaints selected by the presence of the keyword "gout". The 2020 corpora, "GOUT-CC-2020-CORPUS" contains 8037 chief complaints collected from a single month in 2020. The chief complaints included in both corpora were selected based on the presence of the keyword "gout".

## Cleaning and Analysis

**Import Data**

In [15]:
import pandas as pd

syn2019 = pd.read_csv('Data/GOUT-CC-2019-CORPUS-SYNTHETIC.csv')
syn2020 = pd.read_csv('Data/GOUT-CC-2020-CORPUS-SYNTHETIC.csv')


**Data Description**
* 2 csv files
    * 2019 : 300 records
    * 2020 : 8037 records
    * Identical layouts and formats: all text, 3 columns
    <br><br>
* 3 Columns:  ["Chief Complaint", "Predict", "Consensus"]
    * <b>Chief Complaint:</b> 
        * text format
        * up to 282 Chars
        * nurse recorded patient complaint
    * <b>Predict:</b> 
        * text format
        * single char ('-','U','Y','N')
        * prediction of Gout by the ER Physician
    * <b>Consensus:</b> 
        * textformat
        * single char ('-','U','Y','N')
        * determination of Gout by the Rhuematologist
    <br>
* 
          - : Null
          U : Unknonw
          Y : Yes
          N : Gout

## Format Data

In [16]:
print(syn2019.head())

                                     Chief Complaint Predict Consensus
0  "been feeling bad" last 2 weeks & switched BP ...       N         -
1  "can't walk", reports onset at 0830 am. orient...       Y         N
2  "dehydration" Chest hurts, hips hurt, cramps P...       Y         Y
3  "gout flare up" L arm swelling x 1 week. denie...       Y         Y
4  "heart racing,"dyspnea, and orthopnea that has...       N         -


In [17]:
print(syn2020.head())

                                     Chief Complaint Predict Consensus
0  "I dont know whats going on with my head, its ...       N         -
1  "i've been depressed for a few weeks now, i'm ...       N         -
2  Altercation while making arrest, c/o R hand pa...       N         N
3  Cut on L upper thigh wtih saw. Bleeding contro...       N         N
4   Dysuria x1 week. hx: hysterectomy, gerd, bipolar       N         -


**Combine the 2 files**

In [18]:
# Combine the files into 1 dataframe
df = pd.concat([syn2019, syn2020], axis=0).reset_index(drop=True)
print(df.shape)

(8437, 3)


**Review records for null value '-' in the files**

In [19]:
print(df['Predict'].value_counts(sort=False))

U     156
Y     111
N    8168
-       2
Name: Predict, dtype: int64


In [20]:
print(df['Consensus'].value_counts(sort=False))

U      16
Y      95
N     350
-    7976
Name: Consensus, dtype: int64


**Remove records that contain null's '-' in both the 'Predict' and 'Consensus' columns.**

In [21]:
print( df[(df.Consensus == '-') & (df.Predict == '-')])

                                        Chief Complaint Predict Consensus
7799  Right lower back pain that radiates down leg t...       -         -
7857  pain to posterior upper leg x 3 days, seen at ...       -         -


## Clean Data

   * Remove records that contain null values in both of the Predict and Consensus columns.
   * Fill Consensus null values ( - ) with Predict values
   * Change all chars to lowercase
   * Remove punctuation
   * Remove words containing numbers

**Remove records with double 'null' values, records with '-' in both Consensus and Predict.**

In [22]:
df = df[(df.Consensus != '-') | (df.Predict != '-')]
print(df.shape)

(8435, 3)


The predict column contains a value agreed upon by a panel of physicians while the consensus is the 'Rheumatologist findings, patients who did require follow-up with a Rhuematologist will be included using the predict values.

**Fill null values in consensus with predict value**

In [23]:
for a in df['Consensus']:
    if a == '-':
        df['Consensus'] = df['Predict']

In [24]:
print(df['Consensus'].value_counts(sort=False))

U     156
Y     111
N    8168
Name: Consensus, dtype: int64


In [25]:
df = df.drop(columns=['Predict'])

In [26]:
df = df.rename(columns={'Chief Complaint': 'corpus', 'Consensus': 'target'})
df

Unnamed: 0,corpus,target
0,"""been feeling bad"" last 2 weeks & switched BP ...",N
1,"""can't walk"", reports onset at 0830 am. orient...",Y
2,"""dehydration"" Chest hurts, hips hurt, cramps P...",Y
3,"""gout flare up"" L arm swelling x 1 week. denie...",Y
4,"""heart racing,""dyspnea, and orthopnea that has...",N
...,...,...
8432,"stepped on a nail at home with right foot, pai...",N
8433,""" I was having a breakdown."" R/T stress and de...",N
8434,"""I tried to jump in front of a car"" Pt states ...",N
8435,Abdominal pain x 1 week. Denies PMH,N


In [27]:
import warnings
warnings.filterwarnings('ignore')

df_slice = df[0:100]
df_slice['score'] = ''

In [28]:
print(df_slice.shape)
df_slice

(100, 3)


Unnamed: 0,corpus,target,score
0,"""been feeling bad"" last 2 weeks & switched BP ...",N,
1,"""can't walk"", reports onset at 0830 am. orient...",Y,
2,"""dehydration"" Chest hurts, hips hurt, cramps P...",Y,
3,"""gout flare up"" L arm swelling x 1 week. denie...",Y,
4,"""heart racing,""dyspnea, and orthopnea that has...",N,
...,...,...,...
95,"gout flare in left foot, right 5th finger, and...",Y,
96,"gout flare up - out of meds pmhx: DM, gout",Y,
97,gout flare up that started yesterday in left w...,Y,
98,"gout flare up to both feet, unable to ambulate...",Y,


### Huggingface transformer "zero-shot"
Use of Huggingface transformer 'zero-shot' to classfify the patient's chief complaint.  The transformer gives the sentence a 'score' likelihood the topic is 'gout'.

We are looking for scores that are higher for the target = Y and lower for N with a cut-off score.

In [29]:
import transformers
from transformers import pipeline

# instantiate the zero-shot classification pipeline
zero = pipeline("zero-shot-classification")

In [30]:
import warnings
warnings.filterwarnings('ignore')

# feed each line of the corpus and collect the resulting score for each line
for i in range(0,100,1):
    collecting = zero(df_slice['corpus'][i],candidate_labels=["gout", "not gout"])
    df_slice['score'][i] = list(collecting['scores'])

In [31]:
# results
df_slice

Unnamed: 0,corpus,target,score
0,"""been feeling bad"" last 2 weeks & switched BP ...",N,"[0.9907973408699036, 0.009202688001096249]"
1,"""can't walk"", reports onset at 0830 am. orient...",Y,"[0.9805265069007874, 0.01947350613772869]"
2,"""dehydration"" Chest hurts, hips hurt, cramps P...",Y,"[0.9907540082931519, 0.009246038272976875]"
3,"""gout flare up"" L arm swelling x 1 week. denie...",Y,"[0.9959062337875366, 0.004093728959560394]"
4,"""heart racing,""dyspnea, and orthopnea that has...",N,"[0.9372850060462952, 0.06271496415138245]"
...,...,...,...
95,"gout flare in left foot, right 5th finger, and...",Y,"[0.997671365737915, 0.002328585833311081]"
96,"gout flare up - out of meds pmhx: DM, gout",Y,"[0.9974185228347778, 0.002581487176939845]"
97,gout flare up that started yesterday in left w...,Y,"[0.9982471466064453, 0.0017528717871755362]"
98,"gout flare up to both feet, unable to ambulate...",Y,"[0.9961241483688354, 0.0038758094888180494]"


In [32]:
# view the scores for target = Y
print(df_slice.sort_values(by='target', ascending = False).head(10))

                                               corpus target  \
50  C/O LLE swelling/stiff x2d and radiating into ...      Y   
58  Called back to ED for positive blood cultures ...      Y   
30  bilateral feet swelling and pain. ambulatory, ...      Y   
31  bilateral knee swelling Left sided chest pain,...      Y   
32  bilateral leg and feet pain. weeping fluid. ha...      Y   
48  C/o left foot and knee pain since Tuesday "I t...      Y   
1   "can't walk", reports onset at 0830 am. orient...      Y   
53  c/o R big toe hurting from gout and headaches ...      Y   
55  C/O sudden onset nontraumatic R knee pain and ...      Y   
62  Chest pain and legs burning and joint pain x 1...      Y   

                                          score  
50   [0.9898225665092468, 0.010177397169172764]  
58  [0.9950408339500427, 0.0049592056311666965]  
30   [0.9724188446998596, 0.027581194415688515]  
31   [0.9617189764976501, 0.038281094282865524]  
32    [0.9339268803596497, 0.0660731568932533

We see great scores for the target = Y and looks promising, however, looking below at the 'N' target (negative) there are also many in the 90%

In [33]:
# view the scores for target = N
print(df_slice.sort_values(by='target', ascending = True).head(10))

                                               corpus target  \
0   "been feeling bad" last 2 weeks & switched BP ...      N   
66  Chest tightness with SOB, weakness x1 week;  P...      N   
64  chest pain x 9 hours- sharp right sided pain P...      N   
63  chest pain radiating to R side x2 1/2 hrs, dia...      N   
61  chest and back pain that has been going on for...      N   
60  chest and abd pain, recently dx w/ flu, given ...      N   
57  call to EMS for heart palpitations, possible S...      N   
56  C/o vaginal and buttock pressure with N/V/D x ...      N   
67  Chronic neck, shoulder and back pain self plac...      N   
54  c/o SOB, Aching and congestion since 0827. PMH...      N   

                                         score  
0   [0.9907973408699036, 0.009202688001096249]  
66   [0.9549756646156311, 0.04502439126372337]  
64  [0.9835835695266724, 0.016416503116488457]  
63  [0.9958539009094238, 0.004146095830947161]  
61  [0.9624640941619873, 0.037535883486270905]  
6

In [42]:
collect = []
df_slice = df[0:100]

for i in range(0,100,1):
    collecting = zero(df_slice['corpus'][i],candidate_labels=["gout", 'pregnant', 'injury', 'flu'])
    collect.append(collecting)

In [44]:
df2 = pd.DataFrame(columns = {'corpus', 'target','gout', 'pregnant', 'injury', 'flu'})

scores = list(map(lambda x: x["scores"], collect))

In [45]:
i = 0
for sublist in scores:
    df2.loc[i] = {'corpus':df_slice['corpus'][i],
                  'target':df_slice['target'][i],
                  'gout': sublist[0],
                  'pregnant':sublist[1],
                  'injury':sublist[2], 
                  'flu':sublist[3]}
    i=i+1

In [47]:
print(df2.sort_values(by='target', ascending=False))

    pregnant                                             corpus    injury  \
50  0.388827  C/O LLE swelling/stiff x2d and radiating into ...  0.005782   
58  0.245121  Called back to ED for positive blood cultures ...  0.002637   
30  0.153973  bilateral feet swelling and pain. ambulatory, ...  0.004787   
31  0.363073  bilateral knee swelling Left sided chest pain,...  0.012380   
32  0.239244  bilateral leg and feet pain. weeping fluid. ha...  0.014725   
..       ...                                                ...       ...   
46  0.286967  c/o HA & elevated BP since this morning PMHx: ...  0.003002   
45  0.114937                    c/o fevers, HA, PMHx: gout, HTN  0.037058   
44  0.051859  c/o fatigue x2 days and wanting to get assesse...  0.004215   
43  0.255679  C/o depression and anxiety with SI. Admitted t...  0.037422   
0   0.068769  "been feeling bad" last 2 weeks & switched BP ...  0.004581   

        gout       flu target  
50  0.601594  0.003797      Y  
58  0.75144

### Statistical score the corpus is talking about 'gout'

In [57]:
from transformers import pipeline
nlp = pipeline("question-answering")

In [104]:
all_results
for i in range(0,10,1):
    question = df['corpus'][i]
    print(nlp(question='Is this gout?', context=question))

{'score': 0.05127223953604698, 'start': 70, 'end': 77, 'answer': 'worried'}
{'score': 0.31441476941108704, 'start': 179, 'end': 195, 'answer': 'gout - pmhx: CVA'}
{'score': 0.09854094684123993, 'start': 104, 'end': 152, 'answer': 'thinks he has a gout flair up knee and foot pain'}
{'score': 0.2828519344329834, 'start': 41, 'end': 61, 'answer': 'denies any other pmh'}
{'score': 0.17536409199237823, 'start': 118, 'end': 122, 'answer': 'gout'}
{'score': 0.05314946547150612, 'start': 1, 'end': 25, 'answer': 'I started breathing hard'}
{'score': 0.11026431620121002, 'start': 33, 'end': 70, 'answer': 'L wrist pain & swelling since 0838 AM'}
{'score': 0.1614694446325302, 'start': 68, 'end': 88, 'answer': 'having gout flare up'}
{'score': 0.30471810698509216, 'start': 86, 'end': 95, 'answer': 'pmh- gout'}
{'score': 0.08825334906578064, 'start': 1, 'end': 22, 'answer': 'my gout is hurting me'}


In [105]:
new_results = []
for i in range(0,9,1):
    question = df['corpus'][i]
    new_results.append(nlp(question=question, context='gout?'))

In [106]:
scores = list(map(lambda x: x["score"], new_results))

In [107]:
dfslice = df.loc[0:8,:]
dfslice
len(dfslice)
dfslice['scores'] = scores

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfslice['scores'] = scores


In [108]:
dfslice.sort_values(by='target')

Unnamed: 0,corpus,target,scores
0,"""been feeling bad"" last 2 weeks & switched BP ...",N,0.323374
4,"""heart racing,""dyspnea, and orthopnea that has...",N,0.745537
5,"""I started breathing hard"" hx- htn, gout, anx...",N,0.314415
1,"""can't walk"", reports onset at 0830 am. orient...",Y,0.547898
2,"""dehydration"" Chest hurts, hips hurt, cramps P...",Y,0.27777
3,"""gout flare up"" L arm swelling x 1 week. denie...",Y,0.123926
6,"""I think I have a gout flare up"" L wrist pain ...",Y,0.244861
7,"""I want to see if I have an infection"" pt vagu...",Y,0.160254
8,"""My gout done flared up on me"", c/o R ankle, L...",Y,0.198742


### Text Generator

In [None]:
generator = pipeline("text-generation", model="distilgpt2")
generator(
    df['corpus'][0],
    max_length=30,
    num_return_sequences=2,
)