# Development of machine learning models to process Electronic Health Records – Explainable Models

### Testing Notebook
Lok Hang Toby Lee (2431180L)

## Configuration Step
1. Imports
2. Set database configurations
3. Connect to MIMIC-III local postgreSQL database

In [15]:
# Imports:
import numpy as np
import pandas as pd
import sys
import matplotlib.pyplot as plt
from matplotlib import cm
import matplotlib.colors as mc
import colorsys
import psycopg2
import os
import yaml
%matplotlib inline

# Configuration:
sqluser = 'postgres'
dbname = 'mimic'
password='postgres'
schema_name = 'public, mimic, mimiciii;'

# Connect to MIMIC-III:
con = psycopg2.connect(dbname=dbname, user=sqluser, password=password)
cur = con.cursor()
cur.execute('SET search_path to ' + schema_name)

# Proposal 1: Train a machine learning model to predict Mortality rate given features from health record

Features (x):
1. ICU stay days
2. Reason of ICU stay (null if 0 day stays)
3. Age
4. Gender
4. Heart rate
5. Height

Label (y): Mortality of patient

Aims:
1. If the patient did not stay in ICU and have normal features, the mortality should be low. 
2. Find correlation from machine learning if the selected features are significant for mortality rate



## Extract Data for Machine Learning:

1. Extract the data for the features from the database
2. Group up the data into a single pandas array
3. split the pandas array into train validation test set

In [16]:
# Imports:
import psycopg2
import numpy as np
import pandas as pd
import os
import yaml

In [17]:
# Settings for the query:
min_age = 15
limit_population = 0 # if we want to run the query for a small number of patients (for debugging)
if limit_population > 0:
    limit = 'LIMIT ' + str(limit_population)
else:
    limit = ''

In [20]:
query = """
with patient_and_icustay_details as (
    SELECT distinct
        p.gender, p.dob, p.dod, s.*, a.admittime, a.dischtime, a.deathtime, a.ethnicity, a.diagnosis,
        DENSE_RANK() OVER (PARTITION BY a.subject_id ORDER BY a.admittime) AS hospstay_seq,
        DENSE_RANK() OVER (PARTITION BY s.hadm_id ORDER BY s.intime) AS icustay_seq,
        DATE_PART('year', s.intime) - DATE_PART('year', p.dob) as admission_age,
        DATE_PART('day', s.outtime - s.intime) as los_icu
    FROM patients p 
        INNER JOIN icustays s ON p.subject_id = s.subject_id
        INNER JOIN admissions a ON s.hadm_id = a.hadm_id 
    WHERE s.first_careunit NOT like 'NICU'
        and s.hadm_id is not null and s.icustay_id is not null
        and (s.outtime >= (s.intime + interval '12 hours'))
        and (s.outtime <= (s.intime + interval '240 hours'))
    ORDER BY s.subject_id 
)
SELECT * 
FROM patient_and_icustay_details 
WHERE hospstay_seq = 1
    and icustay_seq = 1
    and admission_age >=  """ + str(min_age) + """
    and los_icu >= 0.5
""" + str(limit)
patients_data = pd.read_sql_query('SET search_path to ' + schema_name + query, con)

# Save result:
patients_data.to_csv('static_data.csv')

In [21]:
patients_data

Unnamed: 0,gender,dob,dod,row_id,subject_id,hadm_id,icustay_id,dbsource,first_careunit,last_careunit,...,los,admittime,dischtime,deathtime,ethnicity,diagnosis,hospstay_seq,icustay_seq,admission_age,los_icu
0,M,2025-04-11,2102-06-14,2,3,145834,211552,carevue,MICU,MICU,...,6.0646,2101-10-20 19:08:00,2101-10-31 13:58:00,NaT,WHITE,HYPOTENSION,1,1,76.0,6.0
1,F,2143-05-12,NaT,3,4,185777,294638,carevue,MICU,MICU,...,1.6785,2191-03-16 00:28:00,2191-03-23 18:41:00,NaT,WHITE,"FEVER,DEHYDRATION,FAILURE TO THRIVE",1,1,48.0,1.0
2,F,2109-06-21,NaT,5,6,107064,228232,carevue,SICU,SICU,...,3.6729,2175-05-30 07:15:00,2175-06-15 16:00:00,NaT,WHITE,CHRONIC RENAL FAILURE/SDA,1,1,66.0,3.0
3,M,2108-01-26,2149-11-14,9,9,150750,220597,carevue,MICU,MICU,...,5.3231,2149-11-09 13:06:00,2149-11-14 10:15:00,2149-11-14 10:15:00,UNKNOWN/NOT SPECIFIED,HEMORRHAGIC CVA,1,1,41.0,5.0
4,F,2128-02-22,2178-11-14,11,11,194540,229441,carevue,SICU,SICU,...,1.5844,2178-04-16 06:18:00,2178-05-11 19:00:00,NaT,WHITE,BRAIN MASS,1,1,50.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
30058,M,2114-09-29,NaT,61527,99983,117390,286606,metavision,CCU,CCU,...,1.0399,2193-04-26 11:35:00,2193-04-29 13:30:00,NaT,UNKNOWN/NOT SPECIFIED,ST ELEVATION MYOCARDIAL INFARCTION;CORONARY AR...,1,1,79.0,1.0
30059,M,2137-04-07,NaT,61529,99991,151118,226241,metavision,TSICU,TSICU,...,3.1426,2184-12-24 08:30:00,2185-01-05 12:15:00,NaT,WHITE,DIVERTICULITIS/SDA,1,1,47.0,3.0
30060,F,2078-10-17,NaT,61530,99992,197084,242052,metavision,MICU,MICU,...,1.9745,2144-07-25 18:03:00,2144-07-28 17:56:00,NaT,WHITE,RETROPERITONEAL HEMORRHAGE,1,1,66.0,1.0
30061,F,2058-05-29,2147-09-29,61531,99995,137810,229633,metavision,CSRU,CSRU,...,2.1615,2147-02-08 08:00:00,2147-02-11 13:15:00,NaT,WHITE,ABDOMINAL AORTIC ANEURYSM/SDA,1,1,89.0,2.0


In [23]:
variables_to_keep = ('Capillary refill rate', 'Diastolic blood pressure', 'Fraction inspired oxygen', 
                     'Glascow coma scale eye opening', 'Glascow coma scale motor response', 'Glascow coma scale total',
                     'Glascow coma scale verbal response', 'Glucose', 'Heart Rate', 'Height', 'Mean blood pressure',
                     'Oxygen saturation', 'Respiratory rate', 'Systolic blood pressure', 'Temperature', 'Weight', 'pH')
var_map = pd.read_csv('static_data.csv')

In [24]:
icu_ids_to_keep = patients_data['icustay_id']
icu_ids_to_keep = tuple(set([str(i) for i in icu_ids_to_keep]))
subjects_to_keep = patients_data['subject_id']
subjects_to_keep = tuple(set([str(i) for i in subjects_to_keep]))
hadms_to_keep = patients_data['hadm_id']
hadms_to_keep = tuple(set([str(i) for i in hadms_to_keep]))

labitems_to_keep = []
chartitems_to_keep = []
for i in range(var_map.shape[0]):
    if var_map['LEVEL2'][i] in variables_to_keep:
        if var_map['LINKSTO'][i] == 'chartevents':
            chartitems_to_keep.append(var_map['ITEMID'][i])
        elif var_map['LINKSTO'][i] == 'labevents':
            labitems_to_keep.append(var_map['ITEMID'][i])
            
all_to_keep = chartitems_to_keep + labitems_to_keep
var_map = var_map[var_map.ITEMID.isin(all_to_keep)]
chartitems_to_keep = tuple(set([str(i) for i in chartitems_to_keep]))
labitems_to_keep = tuple(set([str(i) for i in labitems_to_keep]))

KeyError: 'LEVEL2'

## Training the model
1. Train the model using different methods (Linear regression)
2. Train and combine the features
3. Note the effectiveness of each feature (by analysing covarinace matrix)
4. Compare the performance of models by f1 score
5. Principle component analysis to visualise the results
6. Select the final model 