# HAI
[Hospital acquired infections](https://en.wikipedia.org/wiki/Hospital-acquired_infection#:~:text=A%20hospital-acquired%20infection%20HAI,or%20other%20health%20care%20facility), or HAIs, are infections that develop while at a hospital or other healthcare facility that have no evidence of existing before a patient was admitted to the facility. The CDC approximates that [1 in 31](https://www.cdc.gov/hai/data/index.html) admitted patients will acquire at least one HAI.

We'll explore HAIs here and look at the difference between demographic and patient event data. We hope to find answers to the question "What patients seem to be most at risk for contracting an HAI?" through different machine learning tools.

The dataset we're using is a sample from the MIMIC-III....

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import os.path

microbiology_events = pd.read_csv("../data/raw/mimic-iii-demo/MICROBIOLOGYEVENTS.csv")
chart_events = pd.read_csv("../data/raw/mimic-iii-demo/CHARTEVENTS.csv")
admission = pd.read_csv("../data/raw/mimic-iii-demo/ADMISSIONS.csv")

date_cols = ['dob', 'dod']
patient = pd.read_csv("../data/raw/mimic-iii-demo/PATIENTS.csv", parse_dates=date_cols)

  interactivity=interactivity, compiler=compiler, result=result)


In [2]:
mrsa_positive_columns = ['subject_id', 'hadm_id', 'org_name']
mrsa_pt = microbiology_events.loc[ : , mrsa_positive_columns]

mrsa_pt['mrsa_positive'] = np.where(mrsa_pt['org_name']=='POSITIVE FOR METHICILLIN RESISTANT STAPH AUREUS', '1', '0')

mrsa_pt = mrsa_pt[mrsa_pt['mrsa_positive'] != '0'].drop_duplicates()
mrsa_pt

Unnamed: 0,subject_id,hadm_id,org_name,mrsa_positive
922,40204,175237,POSITIVE FOR METHICILLIN RESISTANT STAPH AUREUS,1
955,40310,186361,POSITIVE FOR METHICILLIN RESISTANT STAPH AUREUS,1
1055,40595,116518,POSITIVE FOR METHICILLIN RESISTANT STAPH AUREUS,1
1086,41795,138132,POSITIVE FOR METHICILLIN RESISTANT STAPH AUREUS,1
1190,41914,101361,POSITIVE FOR METHICILLIN RESISTANT STAPH AUREUS,1
1247,41976,173269,POSITIVE FOR METHICILLIN RESISTANT STAPH AUREUS,1
1259,41976,176016,POSITIVE FOR METHICILLIN RESISTANT STAPH AUREUS,1
1664,42135,102203,POSITIVE FOR METHICILLIN RESISTANT STAPH AUREUS,1
1811,42367,139932,POSITIVE FOR METHICILLIN RESISTANT STAPH AUREUS,1
1943,44154,174245,POSITIVE FOR METHICILLIN RESISTANT STAPH AUREUS,1


In [3]:
#demographics["age"] = ((demographics["dod"]-demographics["dob"]).dt.days) //365
#demographics['dod'] = pd.to_datetime(demographics["dod"], infer_datetime_format=True).dt.date
#demographics.dtypes
#demographics["age"] = demographics["dod"]-demographics["dob"]

In [4]:
pt_info = pd.merge(admission, patient, on='subject_id')
pt_info = pd.merge(pt_info, mrsa_pt, on='subject_id', how='left')

pt_info.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 144 entries, 0 to 143
Data columns (total 29 columns):
row_id_x                144 non-null int64
subject_id              144 non-null int64
hadm_id_x               144 non-null int64
admittime               144 non-null object
dischtime               144 non-null object
deathtime               40 non-null object
admission_type          144 non-null object
admission_location      144 non-null object
discharge_location      144 non-null object
insurance               144 non-null object
language                96 non-null object
religion                143 non-null object
marital_status          128 non-null object
ethnicity               144 non-null object
edregtime               107 non-null object
edouttime               107 non-null object
diagnosis               144 non-null object
hospital_expire_flag    144 non-null int64
has_chartevents_data    144 non-null int64
row_id_y                144 non-null int64
gender                 

We'll make a data frame with demographic factors first: age, ethnicity, marital status, etc. 

In [5]:
pt_info_columns = ['insurance', 'gender', 'ethnicity', 'admission_type', 'admission_location', 'diagnosis', 'mrsa_positive']#, 'age']
pt_info = pt_info.loc[ : , pt_info_columns]
pt_info

Unnamed: 0,insurance,gender,ethnicity,admission_type,admission_location,diagnosis,mrsa_positive
0,Medicare,F,BLACK/AFRICAN AMERICAN,EMERGENCY,EMERGENCY ROOM ADMIT,SEPSIS,
1,Private,F,UNKNOWN/NOT SPECIFIED,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,HEPATITIS B,
2,Medicare,F,UNKNOWN/NOT SPECIFIED,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,SEPSIS,
3,Medicare,F,WHITE,EMERGENCY,EMERGENCY ROOM ADMIT,HUMERAL FRACTURE,
4,Medicare,M,WHITE,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,ALCOHOLIC HEPATITIS,
...,...,...,...,...,...,...,...
139,Private,M,WHITE,EMERGENCY,EMERGENCY ROOM ADMIT,PERICARDIAL EFFUSION,
140,Medicare,M,WHITE,EMERGENCY,EMERGENCY ROOM ADMIT,ALTERED MENTAL STATUS,1
141,Medicare,F,BLACK/AFRICAN AMERICAN,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,ACUTE RESPIRATORY DISTRESS SYNDROME;ACUTE RENA...,
142,Medicare,M,WHITE,EMERGENCY,EMERGENCY ROOM ADMIT,BRADYCARDIA,1


A useful library to start an EDA from scratch is [Pandas Profiling](https://pypi.org/project/pandas-profiling/). It auto generates EDA reports from any pandas dataframe.

In [11]:
#Corr matrix patient demographic
import pandas_profiling

pandas_profiling.ProfileReport(pt_info)

ModuleNotFoundError: No module named 'pandas_profiling'

Next, we'll create a dataframe for patient events and look into this data.

We run into a problem here. Some patients have multiple diagnoses, while others only have one. As good data practice, we don't want to have columns with multiple attributes. However, we also do not want our data to become too wide. This is something we will address in the [feature engineering notebook]("./01-feature-engineering"). For now, we just need to be aware of the different features

In [9]:
pt_info["diagnosis"].value_counts()

SEPSIS                                               15
PNEUMONIA                                            12
FEVER                                                 5
SHORTNESS OF BREATH                                   4
CONGESTIVE HEART FAILURE                              3
                                                     ..
CHRONIC MYELOGENOUS LEUKEMIA;TRANSFUSION REACTION     1
CELLULITIS                                            1
SEIZURE;STATUS EPILEPTICUS                            1
S/P FALL                                              1
VOLVULUS                                              1
Name: diagnosis, Length: 95, dtype: int64

In [10]:
#pt_info.to_csv("../data/interim/pt_info.csv")