# Overview of the Data

MIMIC-III (‘Medical Information Mart for Intensive Care’) is a large, single-center database comprising information relating to patients admitted to critical care units at a large tertiary care hospital. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more.   
MIMIC-III is a relational database consisting of 26 tables. Here we want to give a overview of each single table.

## The ADMISSIONS TABLE

The ADMISSIONS table gives information regarding a patient’s admission to the hospital. Since each unique hospital visit for a patient is assigned a unique HADM_ID, the ADMISSIONS table can be considered as a definition table for HADM_ID. Information available includes timing information for admission and discharge, demographic information, the source of the admission, and so on.  

In [None]:
import pandas as pd
import matplotlib as plt
%matplotlib inline
pd.options.display.float_format = '{:,}'.format

In [None]:
df_admissions = pd.read_csv('data/ADMISSIONS.csv.gz', compression='gzip')

In [None]:
df_admissions.head()

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,ADMITTIME,DISCHTIME,DEATHTIME,ADMISSION_TYPE,ADMISSION_LOCATION,DISCHARGE_LOCATION,INSURANCE,LANGUAGE,RELIGION,MARITAL_STATUS,ETHNICITY,EDREGTIME,EDOUTTIME,DIAGNOSIS,HOSPITAL_EXPIRE_FLAG,HAS_CHARTEVENTS_DATA
0,21,22,165315,2196-04-09 12:26:00,2196-04-10 15:54:00,,EMERGENCY,EMERGENCY ROOM ADMIT,DISC-TRAN CANCER/CHLDRN H,Private,,UNOBTAINABLE,MARRIED,WHITE,2196-04-09 10:06:00,2196-04-09 13:24:00,BENZODIAZEPINE OVERDOSE,0,1
1,22,23,152223,2153-09-03 07:15:00,2153-09-08 19:10:00,,ELECTIVE,PHYS REFERRAL/NORMAL DELI,HOME HEALTH CARE,Medicare,,CATHOLIC,MARRIED,WHITE,,,CORONARY ARTERY DISEASE\CORONARY ARTERY BYPASS...,0,1
2,23,23,124321,2157-10-18 19:34:00,2157-10-25 14:00:00,,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,HOME HEALTH CARE,Medicare,ENGL,CATHOLIC,MARRIED,WHITE,,,BRAIN MASS,0,1
3,24,24,161859,2139-06-06 16:14:00,2139-06-09 12:48:00,,EMERGENCY,TRANSFER FROM HOSP/EXTRAM,HOME,Private,,PROTESTANT QUAKER,SINGLE,WHITE,,,INTERIOR MYOCARDIAL INFARCTION,0,1
4,25,25,129635,2160-11-02 02:06:00,2160-11-05 14:55:00,,EMERGENCY,EMERGENCY ROOM ADMIT,HOME,Private,,UNOBTAINABLE,MARRIED,WHITE,2160-11-02 01:01:00,2160-11-02 04:27:00,ACUTE CORONARY SYNDROME,0,1


In [None]:
df_admissions.shape

(58976, 19)

In [None]:
df_admissions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58976 entries, 0 to 58975
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   ROW_ID                58976 non-null  int64 
 1   SUBJECT_ID            58976 non-null  int64 
 2   HADM_ID               58976 non-null  int64 
 3   ADMITTIME             58976 non-null  object
 4   DISCHTIME             58976 non-null  object
 5   DEATHTIME             5854 non-null   object
 6   ADMISSION_TYPE        58976 non-null  object
 7   ADMISSION_LOCATION    58976 non-null  object
 8   DISCHARGE_LOCATION    58976 non-null  object
 9   INSURANCE             58976 non-null  object
 10  LANGUAGE              33644 non-null  object
 11  RELIGION              58518 non-null  object
 12  MARITAL_STATUS        48848 non-null  object
 13  ETHNICITY             58976 non-null  object
 14  EDREGTIME             30877 non-null  object
 15  EDOUTTIME             30877 non-null

In [None]:
df_admissions.SUBJECT_ID.nunique()

46520

Here it is to be seen: The Dataset contains 58976 Admissions of 46520 patients. 

|Column| Describtion|Type|
|:-----|:-----------|:----|
|ROW_ID |gives a index of the table |int|  
|SUBJECT_ID|gives a index to every single patient|int|    
|HADM_ID|range(100000 - 1999999), which represents a single patient’s admission to the hospital | int|  
|ADMITTIME|provides the date and time the patient was admitted to the hospital |Timestamp |  
|DISCHTIME| provides the date and time the patient was discharged from the hospital|TIMESTAMP|  
|DEATHTIME| provides (if applicable)the time of in-hospital death for the patient. Is only present if the patient died in-hospital, and is almost always the same as the patient’s DISCHTIME.|TIMESTAMP|   
|ADMISSION_TYPE| describes the type of the admission: ‘ELECTIVE’, ‘URGENT’, ‘NEWBORN’ or ‘EMERGENCY’. Emergency/urgent indicate unplanned medical care, and are often collapsed into a single category in studies. Elective indicates a previously planned hospital admission. Newborn indicates that the HADM_ID pertains to the patient’s birth.|string|    
|ADMISSION_LOCATION|provides information about the previous location of the patient prior to arriving at the hospital. There are 9 possible values: EMERGENCY ROOM ADMIT, TRANSFER FROM HOSP/EXTRAM, TRANSFER FROM OTHER HEALT, CLINIC REFERRAL/PREMATURE, INFO NOT AVAILABLE, TRANSFER FROM SKILLED NUR, TRSF WITHIN THIS FACILITY, HMO REFERRAL/SICK, PHYS REFERRAL/NORMAL DELI|string| 
|DISCHARGE_LOCATION	| provides information about the location when the patient is descharged|string|   
|INSURANCE|describes the health insurance of the patient |string|
|LANGUAGE|native language |string|
|RELIGION|religious affiliation |string|
|MARITAL_STATUS|marital status  |string|
|ETHNICITY|	ethnicity = not important |string|
|EDREGTIME|time that the patient was registered from the emergency department|	TIMESTAMP|
|EDOUTTIME|	time that the patient was discharged from the emergency department|TIMESTAMP|
|DIAGNOSIS|	provides a preliminary, free text diagnosis for the patient on hospital admission. The diagnosis is usually assigned by the admitting clinician and does not use a systematic ontology. |string|
|HOSPITAL_EXPIRE_FLAG|indicates whether the patient died within the given hospitalization. 1 indicates death in the hospital, and 0 indicates survival to hospital discharge.|integer|
|HAS_CHARTEVENTS_DATA|	indicates wether the patient occurs in the Charteventstable. 1 indicates has charteventsdata, and 0 indicates not.|integer|

In [None]:
# drop out the newborn babies, so that are only adults in the dataframe
df_admissions_adults = df_admissions[df_admissions.ADMISSION_TYPE != "NEWBORN"]

In [None]:
df_admissions_adults.shape

(51113, 11)

In [None]:
# How many adult admission have no entries in the chartevents?
len(df_admissions_adults[df_admissions_adults.HAS_CHARTEVENTS_DATA == 0])

1492

In [None]:
# Check how many Admissions of them have ARDS
liste_patienten = list(pd.read_csv('data/liste_patienten.csv'))
liste_patienten = [int(x) for x in liste_patienten]

In [None]:
df_admissions_adults[df_admissions_adults.HAS_CHARTEVENTS_DATA == 0].HADM_ID.isin(liste_patienten).sum()


53

There are 1492 admissions without any entry in the CHARTTIMEEVENT Table. 
Only 53 of these admissions represent patients with ARDS. 
Because the most important informations is to be found in the CHARTTIMEEVENT Table, we have to work
 without these patients.

**Mortality all in all**

In [None]:
len(df_admissions[df_admissions.HOSPITAL_EXPIRE_FLAG == 1])

5854

Of all 58976 admissions 5854 patients died in hospital = 10%.

**Mortality Adults**

In [None]:
len(df_admissions_adults[df_admissions_adults.HOSPITAL_EXPIRE_FLAG == 1])

5792

Of all adult admissions (51113) died 5792 patients in hospital = 11%.

**Mortality Babies**

The mortality rate of the babies is: 62 / 7833 = 0,8% , significantly less than among the adults.

# CPT_EVENTS 

The CPTEVENTS table contains a list of which current procedural terminology codes were billed for which patients. This can be useful for determining if certain procedures have been performed (e.g. ventilation).



Column|Description|type
-----|:------|:----
COSTCENTER|COSTCENTER is the cost center which billed for the corresponding CPT codes. There are two possible cost centers: **‘ICU’ and ‘Resp’**. ‘Resp’ codes correspond to mechanical or non-invasive ventilation and were billed by the respiratory therapist. ‘ICU’ codes correspond to the procedures billed for by the ICU. **From our croup of ARDS there are 6052 patients charged with Resp.**|object
CHARTDATE|The date at which the procedure occurred.**--> DROP**|object
CPT_CD|CPT_CD contains the original CPT code. The cpt_codes that interest us are **94003** and **94002** and from the cpt (there are in both integer and string form) **-->DROP**|object
CPT_NUMBER|**same as CPT_CODE** but as a float so it is **better** to work with|float
CPT_SUFFIX|The CPT_SUFFIX column contains the text suffix when the CPT_CD contains non-numeric characters. **--> DROP**|object
TICKET_ID_SEQ|The order of the CPT_CD **-->DROP**|float
SECTIONHEADER|The section headers provide a category for the given CPT code. These headers were assigned using the D_CPT table.(not helpful) **-->DROP**|object
SUBSECTIONHEADER|(not helpful) **-->DROP**|object
DESCRIPTION|When the COSTCENTER is "Resp" it describes how the ventilation was charged|object

In [None]:
df_cpt = pd.read_csv('data/CPTEVENTS.csv.gz', 
                                compression='gzip')

  interactivity=interactivity, compiler=compiler, result=result)


In [None]:
df_cpt.head()

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,COSTCENTER,CHARTDATE,CPT_CD,CPT_NUMBER,CPT_SUFFIX,TICKET_ID_SEQ,SECTIONHEADER,SUBSECTIONHEADER,DESCRIPTION
0,317,11743,129545,ICU,,99232,99232.0,,6.0,Evaluation and management,Hospital inpatient services,
1,318,11743,129545,ICU,,99232,99232.0,,7.0,Evaluation and management,Hospital inpatient services,
2,319,11743,129545,ICU,,99232,99232.0,,8.0,Evaluation and management,Hospital inpatient services,
3,320,11743,129545,ICU,,99232,99232.0,,9.0,Evaluation and management,Hospital inpatient services,
4,321,6185,183725,ICU,,99223,99223.0,,1.0,Evaluation and management,Hospital inpatient services,


In [None]:
df_cpt.shape

(573146, 12)

In [None]:
df_cpt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 573146 entries, 0 to 573145
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   ROW_ID            573146 non-null  int64  
 1   SUBJECT_ID        573146 non-null  int64  
 2   HADM_ID           573146 non-null  int64  
 3   COSTCENTER        573146 non-null  object 
 4   CHARTDATE         101545 non-null  object 
 5   CPT_CD            573146 non-null  object 
 6   CPT_NUMBER        573128 non-null  float64
 7   CPT_SUFFIX        22 non-null      object 
 8   TICKET_ID_SEQ     471601 non-null  float64
 9   SECTIONHEADER     573125 non-null  object 
 10  SUBSECTIONHEADER  573125 non-null  object 
 11  DESCRIPTION       101545 non-null  object 
dtypes: float64(2), int64(3), object(7)
memory usage: 52.5+ MB


In [None]:
f'Missing from HADM_ID column: {round(df_cpt.HADM_ID.isnull().sum()/len(df_cpt)*100,2)}%'

'Missing from HADM_ID column: 0.0%'

In [None]:
df_cpt.DESCRIPTION.unique()

array([nan, 'VENT MGMT;SUBSQ DAYS(INVASIVE)',
       'VENT MGMT, 1ST DAY (INVASIVE)', 'VENT MGMT;SUBSQ DAYS(NIV)',
       'VENT MGMT,1ST DAY (NIV)'], dtype=object)

In [None]:
df_cpt_ARDS[df_cpt_ARDS.CPT_NUMBER == 94003].DESCRIPTION.unique()

array(['VENT MGMT;SUBSQ DAYS(INVASIVE)', 'VENT MGMT;SUBSQ DAYS(NIV)'],
      dtype=object)

In [None]:
df_cpt_ARDS[df_cpt_ARDS.CPT_NUMBER == 94002].DESCRIPTION.unique()

array([nan, 'VENT MGMT, 1ST DAY (INVASIVE)', 'VENT MGMT,1ST DAY (NIV)'],
      dtype=object)

#### Seperating our patients

In [None]:
df_cpt_ARDS = df_cpt.loc[df_cpt['HADM_ID'].isin(liste_patienten)]

In [None]:
f'{df_cpt_ARDS[df_cpt_ARDS.COSTCENTER == "Resp"].DESCRIPTION.isnull().sum()} NaN in DESCRIPTION column when COSTCENTER=="Resp"'

'0 NaN in DESCRIPTION column when COSTCENTER=="Resp"'

In [None]:
f'Missing in Descreption Overall:{round(df_cpt_ARDS.DESCRIPTION.isnull().sum()/len(df_cpt_ARDS)*100,2)}%'

'Missing in Descreption Overall:75.75%'

In [None]:
df_cpt_ARDS.head(2)

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,COSTCENTER,CHARTDATE,CPT_CD,CPT_NUMBER,CPT_SUFFIX,TICKET_ID_SEQ,SECTIONHEADER,SUBSECTIONHEADER,DESCRIPTION
26,343,21910,171236,ICU,,99255,99255.0,,1.0,Evaluation and management,Consultations,
27,344,21910,171236,ICU,,99232,99232.0,,2.0,Evaluation and management,Hospital inpatient services,


##### unique values of codes from respiratory

In [None]:
# both string and integer in CPT_CD
df_cpt_ARDS[df_cpt_ARDS.COSTCENTER == 'Resp'].CPT_CD.unique()

array(['94003', '94002', 94003, 94002], dtype=object)

In [None]:
df_cpt_ARDS[df_cpt_ARDS.COSTCENTER == 'Resp'].CPT_NUMBER.unique()

array([94003., 94002.])

In [None]:
f'{df_cpt_ARDS[df_cpt_ARDS.COSTCENTER == "Resp"].HADM_ID.nunique()} from our patient list that were charged for ventilation' 

'6052 from our patient list that were charged for ventilation'

##### Create a function to print the patients that are missing from our list in the table

In [None]:
def missing(ARDS):
    miss = len(liste_patienten)-ARDS.HADM_ID.nunique()
    print('Patients report from our list:')
    print(f'Missing: {miss}')
    print(f'perc.: {round(miss/len(liste_patienten)*100,2)}%')
    
    present = ARDS.HADM_ID.nunique()
    print(f'Present: {present}')
    print(f'perc.: {round(present/len(liste_patienten)*100,2)}%')
    
    present = ARDS.HADM_ID.nunique()
    print(f'percentage of unique number patients in the ARDS dataset: {round(present/len(ARDS.HADM_ID)*100,2)}%')

In [None]:
missing(df_cpt_ARDS)

Patients report from our list:
Missing: 366
perc.: 4.88%
Present: 7131
perc.: 95.12%
percentage of unique number patients in the ARDS dataset: 3.97%


## Drgcodes

Diagnosis Related Groups (DRG), which are used by the hospital for billing purposes. **we might not need these table in the end**

Column|Description|type
----|:-----|:-----
DRG_TYPE|DRG_TYPE provides the type of DRG code in the entry. There are two types of DRG codes in the database which have overlapping ranges but distinct definitions for the codes. The three types of DRG codes in the MIMIC-III database are ‘HCFA’ (Health Care Financing Administration), ‘MS’ (Medicare), and ‘APR’ (All Payers Registry). **-->DROP**|object
DGR_CODE|DRG_CODE contains a code which represents the diagnosis billed for by the hospital. Tracheostomy:44, 54. Respitary Disease with ventilation support: 475, 576, 566, 565, 575, 208, 207|int64
DESCRIPTION|They categorize the treatment in a way that they can charge the patients|Object
DRG_SEVERITY|Severity and mortality allow for higher billing costs **--> DROP**|float
DRG_MORTALITY|**DROP**|float

In [None]:
df_drgcodes = pd.read_csv('data/DRGCODES.csv.gz', 
                                compression='gzip')

In [None]:
df_drgcodes.head()

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,DRG_TYPE,DRG_CODE,DESCRIPTION,DRG_SEVERITY,DRG_MORTALITY
0,342,2491,144486,HCFA,28,"TRAUMATIC STUPOR & COMA, COMA <1 HR AGE >17 WI...",,
1,343,24958,162910,HCFA,110,MAJOR CARDIOVASCULAR PROCEDURES WITH COMPLICAT...,,
2,344,18325,153751,HCFA,390,NEONATE WITH OTHER SIGNIFICANT PROBLEMS,,
3,345,17887,182692,HCFA,14,SPECIFIC CEREBROVASCULAR DISORDERS EXCEPT TRAN...,,
4,346,11113,157980,HCFA,390,NEONATE WITH OTHER SIGNIFICANT PROBLEMS,,


In [None]:
df_drgcodes.shape

(125557, 8)

In [None]:
df_drgcodes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 125557 entries, 0 to 125556
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   ROW_ID         125557 non-null  int64  
 1   SUBJECT_ID     125557 non-null  int64  
 2   HADM_ID        125557 non-null  int64  
 3   DRG_TYPE       125557 non-null  object 
 4   DRG_CODE       125557 non-null  int64  
 5   DESCRIPTION    125494 non-null  object 
 6   DRG_SEVERITY   66634 non-null   float64
 7   DRG_MORTALITY  66634 non-null   float64
dtypes: float64(2), int64(4), object(2)
memory usage: 7.7+ MB


In [None]:
df_drgcodes.HADM_ID.isnull().sum()

0

In [None]:
df_drgcodes.DESCRIPTION.nunique()

1367

In [None]:
df_drgcodes= df_drgcodes.dropna(subset=["DESCRIPTION"])
df_drgcodes[df_drgcodes['DESCRIPTION'].str.contains("VENTI")]

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,DRG_TYPE,DRG_CODE,DESCRIPTION,DRG_SEVERITY,DRG_MORTALITY
25,367,3286,133404,HCFA,475,RESPIRATORY SYSTEM DIAGNOSIS WITH VENTILATOR S...,,
173,82,12411,173718,HCFA,475,RESPIRATORY SYSTEM DIAGNOSIS WITH VENTILATOR S...,,
174,83,16053,189195,HCFA,475,RESPIRATORY SYSTEM DIAGNOSIS WITH VENTILATOR S...,,
201,110,26455,122530,HCFA,475,RESPIRATORY SYSTEM DIAGNOSIS WITH VENTILATOR S...,,
271,180,4454,177326,HCFA,475,RESPIRATORY SYSTEM DIAGNOSIS WITH VENTILATOR S...,,
...,...,...,...,...,...,...,...,...
125043,124026,30575,186455,MS,208,RESPIRATORY SYSTEM DIAGNOSIS W VENTILATOR SUPP...,,
125082,125186,59924,199321,MS,208,RESPIRATORY SYSTEM DIAGNOSIS W VENTILATOR SUPP...,,
125275,124563,50487,165833,MS,208,RESPIRATORY SYSTEM DIAGNOSIS W VENTILATOR SUPP...,,
125320,125234,28827,190139,MS,208,RESPIRATORY SYSTEM DIAGNOSIS W VENTILATOR SUPP...,,


from the 3078 only 1770 are from our patient list

In [None]:
df_drgcodes[df_drgcodes['DESCRIPTION'].str.contains("VENTI")].DESCRIPTION.unique()

array(['RESPIRATORY SYSTEM DIAGNOSIS WITH VENTILATOR SUPPORT',
       'TRACHEOSTOMY WITH MECHANICAL VENTILATION 96+ HOURS OR PRINCIPAL DIAGNOSIS EXCEPT FACE, MOUTH, AND NECK DIAGNOSES',
       'ECMO OR TRACHEOSTOMY WITH MECHANICAL VENTILATION 96+ HOURS OR PRINCIPAL DIAGNOSES EXCEPT FACE, MOUTH AND NECK DIAGNOSES WITH MAJOR OPERATING ROOM PROCEDURE',
       'TRACHEOSTOMY WITH MECHANICAL VENTILATION 96+ HOURS OR PRINCIPAL DIAGNOSIS EXCEPT FACE, MOUTH AND NECK DIAGNOSES WITHOUT MAJOR OPERATING ROOM PROCEDURE',
       'SEPTICEMIA W MECHANICAL VENTILATOR 96+ HOURS AGE >17',
       'SEPTICEMIA W MECHANICAL VENTILATOR W/0 96+ HOURS AGE >17',
       'RESPIRATORY SYSTEM DIAGNOSIS WITH VENTILATOR SUPPORT <96 HRS',
       'RESPIRATORY SYSTEM DIAGNOSIS WITH VENTILATOR SUPPORT 96+ HRS',
       'RESPIRATORY SYSTEM DIAGNOSIS W VENTILATOR SUPPORT <96 HOURS',
       'RESPIRATORY SYSTEM DIAGNOSIS W VENTILATOR SUPPORT 96+ HOURS'],
      dtype=object)