<h1 style="text-align: center; font-size: 40px;">MIMIC-IV ED Data Cleaning and Exploration</h1>
<h3 style="text-align: center; color: gray; font-style: italic;">
<h2 style="text-align:center; color:#4F81BD;">1. Setup and Import Libraries</h2>

In this section, I import all required Python libraries and set up project-relative paths for reproducibility.  
This ensures that the notebook can run on any machine without changing file paths.  
Establishing a clean and consistent environment helps maintain reproducibility and clarity for collaborators and our TA, Amitash!

In [1]:
import duckdb
import pandas as pd
import pathlib as pl

# automatically locates the project root and set up relative paths to the data folder
# this makes the notebook reproducible for anyone who clones the repo
ROOT = pl.Path.cwd().parent
DATA = ROOT / "data" / "MIMIC_ED"
RAW = DATA / "raw" / "mimicel.csv"

# relative paths
RAW = pl.Path("../data/MIMIC_ED/raw/mimicel.csv")
CLEAN = pl.Path("../data/MIMIC_ED/cleaned/mimicel_clean.csv")

<h2 style="text-align:center; color:#4F81BD;">2. Load and Inspect Data</h2>

Here, I load the raw MIMIC-IV Emergency Department dataset into a DuckDB connection and convert it to a Pandas DataFrame for exploration.  
The goal is to understand the dataset’s structure, including column names, data types, and potential quality issues before cleaning or analysis.  

Because the full MIMIC-IV ED dataset contains over 7.5 million encounters, I load and inspect a **10% random sample (~200,000 rows)** for initial exploration.  
This subset preserves the distribution of key variables (arrival methods, acuity, dispositions) while allowing for faster computation and interactive data inspection on local hardware.  
All cleaning and validation steps are designed to scale seamlessly to the full dataset later.


In [3]:
# load and inspect
con = duckdb.connect()
con.execute(f"DESCRIBE SELECT * FROM read_csv_auto('{RAW}')").df()

Unnamed: 0,column_name,column_type,null,key,default,extra
0,stay_id,BIGINT,YES,,,
1,subject_id,BIGINT,YES,,,
2,hadm_id,BIGINT,YES,,,
3,timestamps,TIMESTAMP,YES,,,
4,activity,VARCHAR,YES,,,
5,gender,VARCHAR,YES,,,
6,race,VARCHAR,YES,,,
7,arrival_transport,VARCHAR,YES,,,
8,disposition,VARCHAR,YES,,,
9,seq_num,BIGINT,YES,,,


### Initial Observations

From the column summary above, we can see that the MIMIC-IV ED dataset contains both clinical and administrative variables.  
Key features include timestamps, vital signs (e.g., `temperature`, `heartrate`, `respirate`, `o2sat`, `sbp`, `dbp`), and encounter-level identifiers (`stay_id`, `hadm_id`, `subject_id`).  
Non-numeric fields such as `activity`, `disposition`, and `chiefcomplaint` describe the patient’s care process and outcomes.

Before analysis, the dataset will require cleaning to:
- Handle missing or inconsistent entries (e.g., null vitals or undefined dispositions)  
- Convert timestamp and numeric fields to the correct data types  
- Standardize categorical variables such as gender, race, and arrival transport  
- Remove columns that are not relevant to the operational metrics (wait time, LOS, arrival rate, disposition ratio)

These steps ensure data consistency and make the dataset suitable for metric extraction and DES model calibration in later notebooks.

<h2 style="text-align:center; color:#4F81BD;">3. Data Cleaning</h2>

This section focuses on preparing the data for analysis by correcting or removing inconsistent, missing, or invalid entries.  
Cleaning ensures that key variables such as timestamps, gender, arrival method, and vital signs are properly formatted and usable.  
Accurate data cleaning is essential to avoid bias or error in later analyses and simulation validation.

<h2 style="text-align:center; color:#4F81BD;">4. Feature and Metric Extraction</h2>

After cleaning, I compute important operational metrics from the ED dataset.  
The four metrics I use to validate the simulation are **average wait time**, **length of stay**, **arrival patterns**, and **disposition ratios**.  
Due to the de-identification process in MIMIC-IV, all patient timestamps are randomly time-shifted for privacy protection, which prevents direct day-level or chronological reconstruction.  
As a result, these aggregate metrics were selected because they can be reliably estimated without requiring exact event timestamps.  
By grounding the simulation in these empirical summaries, the DES model can reflect realistic patient flow dynamics while respecting the dataset’s privacy limitations.


<h2 style="text-align:center; color:#4F81BD;">5. Save Cleaned Dataset</h2>

In this step, I save the cleaned and processed dataset to the `data/MIMIC_ED/cleaned/` directory.  
This version is formatted for easy loading in later notebooks (e.g., parameter estimation and DES calibration).  
Saving a standardized, cleaned dataset supports reproducibility and allows others to rerun or build upon this work.  

Running this notebook will **overwrite the existing cleaned dataset** in `data/MIMIC_ED/cleaned/`.  
This ensures that results are always generated directly from the raw data and remain fully reproducible across machines.

In [2]:
# Save cleaned dataset safely (overwrite if already exists)
# df.to_csv(CLEAN, index=False)
# print(f"Cleaned dataset saved to: {CLEAN}")


<h2 style="text-align:center; color:#4F81BD;">6. Summary Statistics</h2>

Finally, I generate descriptive statistics and visual summaries of the cleaned dataset.  
These include patient volume, vital sign distributions, arrival patterns, and key timing metrics.  
This summary provides a clear baseline understanding of the ED system, which will be used to validate and calibrate the simulation in later phases.
