# Introduction

The increased use of digital information and electronic health records (EHRs) is creating 'Big Data' to the healthcare industry. Along with the development of technologies such as machine learning and big data techniques, health care industries are utilizing predictive modeling with electronic health records (EHR) to drive personalized medicine and improve healthcare quality. For my final project, I wanted to explore the idea of extracting free text data from physician notes of a EHR to help identify those patients suspected of having life-threatening disease. In particular, I wanted to predict if an individual is at risk of a stroke based on their social, medical, and family history in order to prevent strokes. 

# Problem Statement

To make a classification model to predict risk of strokes.

# Data

Data was extracted from the MIMIC-III (Medical Information Mart for Intesive Care), a large, single-center database comprising information related to patients to critical care units at a large tertiary care hospital, is used. The MIMIC-III database includes 26 tables but for my analysis, I focused on 3 tables. (1) NOTEEVENTS- Nursing and Physician Notes. (2) Diagnoses_ICD (Hospital assigned diagnoses, coded using the International Statistical Classification of Disease and Related Health Problems. (3) D_Diagnoses_ICD( ICD-9 Codes related to diagnoses tables).

In [1]:
import pandas as pd
import re

## Data Cleaning 

In [36]:
class Ehr_df_builder():
    '''  
    Inputs:   
    Load and cleans the three csvs and merges them into a final EHR dataframe
    (1) D_ICD_DIAGNOSES.csv 
    (2) DIAGNOSES_ICD.csv 
    (3) NOTEEVENTS.csv 
    
    Output:
    A dataframe that contains the merged tables    
    '''
    
    def __init__(self):
        
        #runs script
        self.wrapper()
        
    def wrapper(self):
        self.load_data()
        self.clean_text()
        self.remove_null()
        self.int_converter()
        self.dropping_features()
        self.discharge_summary()
        self.final_dataframe()
    
    #Loading ALL the individual csvs 
    def load_data(self):
        '''
        Input:
        D_ICD_DIAGNOSES.csv - Definition table for ICD diagnoses.
        DIAGNOSES_ICD.csv - Contains ICD diagnoses for patients, most notably ICD-9 diagnoses. Codes are generated for billing purposes.
        NOTEEVENTS.csv - Contains all notes for patients from doctor or nurses.
        
        Return:
        noteevents, d_icd_diagnosis, diagnosis_icd dataframes
        '''
        self.noteevents = pd.read_csv('/Users/matttom/Desktop/Final_project/MIMIC_data/NOTEEVENTS.csv')
        self.d_icd_diagnosis = pd.read_csv('/Users/matttom/Desktop/Final_project/MIMIC_data/D_ICD_DIAGNOSES.csv')
        self.diagnosis_icd = pd.read_csv('/Users/matttom/Desktop/Final_project/MIMIC_data/DIAGNOSES_ICD.csv')

    def clean_text(self):
        #Applies regex commands to clean the corpus
        self.noteevents.TEXT = self.noteevents.TEXT.apply(lambda x : x.replace('\n', ' '))
        self.noteevents.TEXT = self.noteevents.TEXT.apply(lambda x:re.sub('\[.*?\]', ' ', x))
        self.noteevents.TEXT = self.noteevents.TEXT.apply(lambda x:re.sub('[^a-zA-Z0-9: ‘,]+', ' ', x))
        
    def remove_null(self):
        #Dropping all the null values from HADM_ID
        self.noteevents.dropna(subset=['HADM_ID'],inplace=True)
        
    def int_converter(self):
        #Convert float objects to int
        self.noteevents.HADM_ID = self.noteevents.HADM_ID.astype(int)
        
    def dropping_features(self):
        #Dropping features that are not important 
        self.noteevents = self.noteevents.drop(columns=['ROW_ID','CHARTDATE','CHARTTIME','STORETIME','CGID','DESCRIPTION'])
        self.d_icd_diagnosis.drop(columns=['ROW_ID', 'SHORT_TITLE'],inplace=True)
        self.diagnosis_icd.drop(columns=['ROW_ID'],inplace=True);
        
        #Only concerned with the diagnosis that are the top priority, which is SEQ_NUM==1.0'''
        self.diagnosis_icd = self.diagnosis_icd[self.diagnosis_icd.SEQ_NUM == 1.0]
        self.diagnosis_icd.reset_index(inplace=True,drop=True)
        
        #Dropping SEQ_NUM COL as it is not important anymore
        self.diagnosis_icd.drop(columns=['SEQ_NUM'],inplace=True)
        
        #IF ISERROR=1, that means a physician has identified that EHR to have an error. We need to remove the physician notes that have errors.
        self.noteevents = self.noteevents[self.noteevents.ISERROR != 1]
        self.noteevents.reset_index(inplace=True,drop=True)
        
        #Don't need the iserror column anymore after we removed the values that statisfied ISERROR =1
        self.noteevents.drop(columns=['ISERROR'], inplace=True)
        
    def discharge_summary(self):
        #Collecting all the physician notes that have the cateogory:discharge summary
        self.noteevents = self.noteevents[self.noteevents.CATEGORY=='Discharge summary']
        
    def final_dataframe(self):
        #Creating a ehr final dataframe that contains ALL 3 tables:noteevents, d_icd_diagnosis, diagnosis_icd. 
        self.diagnosis_df = self.d_icd_diagnosis.merge(self.diagnosis_icd, on ='ICD9_CODE',how='inner')
        self.ehr_df = self.noteevents.merge(self.diagnosis_df, on=['SUBJECT_ID','HADM_ID'], how='inner')
        self.ehr_df.reset_index(inplace=True,drop=True)
        
      

In [31]:
new=DataFrame_Cleaner()

  """Entry point for launching an IPython kernel.


In [33]:
ehr_df = new.ehr_df

In [34]:
ehr_df.shape

(58800, 6)

In [None]:
#Creating a saved pickle of the final dataframe
with open('EHR_df', 'wb') as f: #change
    pickle.dump(ehr_df, f)         #change 