# Data Cleaning - Gout Emergency Department Chief Complaint
 ### The goal of this project is the answer the question:   </np><i>Is the patient potentially suffering from Gout?</i> 

## Scope

The scope of this project is corpora from the Deep South.  The demographics of the population from which they were derived are 54% female, and 46% male, 55% Black, 40% White, 2% Hispanic, and 1% Asian. Age distribution was 5% between ages 1-20 years, 35% between ages 21-40 years, 35% between ages 41-60 years, 20% between ages 61-80 years, and 5% between ages 81-100 years.

## Data

The data is extracted in csv format from the MIMIC-III (Medical Information Mart for Intensive Care III) database.  Details can be found at https://physionet.org/content/emer-complaint-gout/1.0/.   Acces to the database may be requested at (https://mimic.physionet.org/gettingstarted/access/). 

The data provided by the MIMIC database consists of 2 corpora of free text collected by the triage nurse and recorded as the "Chief Complaint".  Each complaint contains up to 282 characters in length and was collected from 2019 to 2020 at an academic medical center in the Deep South.  The 2019 corpora, "GOUT-CC-2019-CORPUS", consists of 300 chief complaints selected by the presence of the keyword "gout". The 2020 corpora, "GOUT-CC-2020-CORPUS" contains 8037 chief complaints collected from a single month in 2020. The chief complaints included in both corpora were selected based on the presence of the keyword "gout".

**Import Data**

In [1]:
import pandas as pd

syn2019 = pd.read_csv('Data/GOUT-CC-2019-CORPUS-SYNTHETIC.csv')
syn2020 = pd.read_csv('Data/GOUT-CC-2020-CORPUS-SYNTHETIC.csv')


**Data Description**
* 2 csv files
    * 2019 : 300 records
    * 2020 : 8037 records
    * Identical layouts and formats: all text, 3 columns
    <br><br>
* 3 Columns:  ["Chief Complaint", "Predict", "Consensus"]
    * <b>Chief Complaint:</b> 
        * text format
        * up to 282 Chars
        * nurse recorded patient complaint
    * <b>Predict:</b> 
        * text format
        * single char ('-','U','Y','N')
        * prediction of Gout by the ER Physician
    * <b>Consensus:</b> 
        * textformat
        * single char ('-','U','Y','N')
        * determination of Gout by the Endocrinologist
    <br>
* 
          - : Null
          U : Unknonw
          Y : Yes
          N : Gout

In [2]:
print(syn2019.head())

                                     Chief Complaint Predict Consensus
0  "been feeling bad" last 2 weeks & switched BP ...       N         -
1  "can't walk", reports onset at 0830 am. orient...       Y         N
2  "dehydration" Chest hurts, hips hurt, cramps P...       Y         Y
3  "gout flare up" L arm swelling x 1 week. denie...       Y         Y
4  "heart racing,"dyspnea, and orthopnea that has...       N         -


In [3]:
print(syn2020.head())

                                     Chief Complaint Predict Consensus
0  "I dont know whats going on with my head, its ...       N         -
1  "i've been depressed for a few weeks now, i'm ...       N         -
2  Altercation while making arrest, c/o R hand pa...       N         N
3  Cut on L upper thigh wtih saw. Bleeding contro...       N         N
4   Dysuria x1 week. hx: hysterectomy, gerd, bipolar       N         -


**Combine the 2 files**

In [4]:
# Combine the files into 1 dataframe
df = pd.concat([syn2019, syn2020], axis=0).reset_index(drop=True)
print(df.shape)

(8437, 3)


**Review records for null value '-' in the files**

In [5]:
print(df['Predict'].value_counts(sort=False))

U     156
N    8168
-       2
Y     111
Name: Predict, dtype: int64


In [6]:
print(df['Consensus'].value_counts(sort=False))

U      16
N     350
-    7976
Y      95
Name: Consensus, dtype: int64


**There are 2 records that have null values in both the Predict and Consensus column and will not provide any predictive quality and so will be removed during cleaning**

In [7]:
print( df[(df.Consensus == '-') & (df.Predict == '-')])

                                        Chief Complaint Predict Consensus
7799  Right lower back pain that radiates down leg t...       -         -
7857  pain to posterior upper leg x 3 days, seen at ...       -         -


## Cleaning

****
    * Remove records that contain null values in both of the Predict and Consensus columns.
    * Change all chars to lowercase
    * Remove punctuation
    * Remove words containing numbers

**Remove records with double 'null' values, records with '-' in both Consensus and Predict.**

In [8]:
df = df[(df.Consensus != '-') | (df.Predict != '-')]
print(df.shape)

(8435, 3)


**Convert to loercase, remove punctuation and words containing numbers**

In [9]:
import re
import string

def clean_text(text):
    text=text.lower()                                                 # change all chars to lowercase    
    text=re.sub('[%s]' % re.escape(string.punctuation), '', text)     # remove punctuations
    text=re.sub("(\\d|\\W)+"," ",text)                                # remove numbers
    return text


df['Chief Complaint'] = df['Chief Complaint'].apply(lambda x:clean_text(x))
df

Unnamed: 0,Chief Complaint,Predict,Consensus
0,been feeling bad last weeks switched bp medica...,N,-
1,cant walk reports onset at am oriented x aorti...,Y,N
2,dehydration chest hurts hips hurt cramps pmh h...,Y,Y
3,gout flare up l arm swelling x week denies any...,Y,Y
4,heart racingdyspnea and orthopnea that has bee...,N,-
...,...,...,...
8432,stepped on a nail at home with right foot pain...,N,N
8433,i was having a breakdown rt stress and depres...,N,-
8434,i tried to jump in front of a car pt states sh...,N,-
8435,abdominal pain x week denies pmh,N,-


**Below we create a Corpus for analysis and a Document-Term Matrix**

In [10]:
matrix_data = df.copy(deep=True)

In [11]:
import nltk
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[\w']+")

In [12]:
df['Chief Complaint'] = df['Chief Complaint'].apply(lambda x: tokenizer.tokenize(x.lower()))

In [13]:
from nltk.corpus import stopwords

def remove_stopwords(text):
    words = [w for w in text if w not in stopwords.words('english')]
    return words

In [14]:
df['Chief Complaint'] = df['Chief Complaint'].apply(lambda x : remove_stopwords(x))
df['Chief Complaint'].head()

0    [feeling, bad, last, weeks, switched, bp, medi...
1    [cant, walk, reports, onset, oriented, x, aort...
2    [dehydration, chest, hurts, hips, hurt, cramps...
3    [gout, flare, l, arm, swelling, x, week, denie...
4    [heart, racingdyspnea, orthopnea, getting, wor...
Name: Chief Complaint, dtype: object

In [15]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def word_lemmatizer(text):
    lem_text = [lemmatizer.lemmatize(i) for i in text]
    return lem_text

In [16]:
df['Chief Complaint'].apply(lambda x: word_lemmatizer(x))

0       [feeling, bad, last, week, switched, bp, medic...
1       [cant, walk, report, onset, oriented, x, aorti...
2       [dehydration, chest, hurt, hip, hurt, cramp, p...
3       [gout, flare, l, arm, swelling, x, week, denie...
4       [heart, racingdyspnea, orthopnea, getting, wor...
                              ...                        
8432    [stepped, nail, home, right, foot, painful, di...
8433                  [breakdown, rt, stress, depression]
8434    [tried, jump, front, car, pt, state, psych, me...
8435              [abdominal, pain, x, week, denies, pmh]
8436    [rashsores, across, body, infection, ro, left,...
Name: Chief Complaint, Length: 8435, dtype: object

**Store the prepped data in a dataframe for Analysis**

In [17]:
import pickle

# Pickle the clean dataframe
df.to_pickle("df.pkl")

**Create Document-Term Matrix**

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1, 1))
vectorized = vectorizer.fit_transform(matrix_data['Chief Complaint'])
vectored_data = pd.DataFrame.sparse.from_spmatrix(vectorized, columns=vectorizer.get_feature_names())

In [19]:
vectored_data

Unnamed: 0,aa,aaa,aain,aao,aaox,ab,abandon,abcess,abcessed,abcessess,...,zofran,zoloft,zoned,zpac,zpack,zpak,zquil,zyprexa,zyrtec,zyrtecd
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8430,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8431,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8432,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8433,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
# Pickle for Analysis
vectored_data.to_pickle("vectored_data.pkl")